Coyotos Microkernel Specification

Version 0.6

September 10, 2007

Jonathan S. Shapiro, Ph.D., Jonathan W. Adams

The EROS Group, LLC

Legal Notice

THIS SPECIFICATION IS PROVIDED ``AS IS'' WITHOUT ANY WARRANTIES, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NON-INFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OF ANY PROPOSAL, SPECIFICATION OR SAMPLE.

Table of Contents

Acknowledgments

Preface

The Original Plan

Overrun by the Hurd

Coyotos Today

1 Overview

1.1 Microkernel Objects

1.2 Entry Capabilities and Extensibility

1.3 Checkpointing and Persistence

1.4 Process States and Exceptions

1.5 Messages

1.6 Naming and Invocation

1.7 Exception and Interrupt Handling

1.8 Protection Model

I Microkernel Abstractions

2 Capabilities

2.1 Representation

2.1.1 Capabilities to Memory Objects

2.1.2 Message-Related Capabilities

2.1.3 Capabilities to Processes

2.1.4 Miscellaneous Capabilities

2.2 Valid Capabilities

2.3 Capability Prepare

2.4 Extensibility

3 Processes

3.1 State of a Process

3.1.1 Per-process State

3.2 Execution Model

3.3 Exception Handling

3.3.1 Exception Delivery

3.3.2 Return From Out-of-Process Handler

3.4 Application-Defined Notifications

4 Address Spaces

4.1 Memory Objects and Address Interpretation

4.1.1 Permissions

4.1.2 References and Access Violations

4.2 Pages and Capability Pages

4.3 Address Space Composition

4.3.1 Translation Algorithm

4.3.2 Exception Handling

4.3.3 Cycle Detection

4.4 Address Space Splitting

5 Capability Invocation (including IPC)

5.1 Invocation Payload

5.2 Invocation-Related Exceptions

5.3 Endpoints

5.4 Semantics of Kernel Capability Invocation

6 System Calls

6.1 Parameters and Parameter Words

6.2 Exceptions

6.3 Capability Locations

6.4 Pseudo-Instructions

6.4.1 Yield [syscall]

6.4.2 CopyCap [syscall]

6.5 InvokeCap [syscall]

6.5.1 Arguments

6.5.1.1 Conventions

6.5.1.2 Kernel Invocation Conventions

6.5.2 Return Values

7 Schedules

7.1 Scheduling Model

8 Other Kernel Objects

II Microkernel Interfaces

coyotos.AddressSpace

Type Definitions

Exceptions

Operations

coyotos.AppNotice

Operations

coyotos.Cap

Type Definitions

Exceptions

Operations

coyotos.CapBits

Structures

Operations

coyotos.CapPage

coyotos.Checkpoint

Exceptions

Operations

coyotos.Discrim

Enumerations

Operations

coyotos.Endpoint

Operations

coyotos.GPT

Constants

Operations

coyotos.IrqCtl

Type Definitions

Operations

coyotos.IrqWait

Operations

coyotos.KernLog

Type Definitions

Operations

coyotos.LocalWindow

Operations

coyotos.Memory

Type Definitions

Enumerations

Operations

coyotos.MemoryHandler

Operations

coyotos.Null

coyotos.ObStore

coyotos.Page

coyotos.Process

Type Definitions

Enumerations

Operations

coyotos.ProcessHandler

Operations

coyotos.Range

Constants

Type Definitions

Exceptions

Enumerations

Operations

coyotos.RcvQueue

Operations

coyotos.SchedCtl

coyotos.Schedule

coyotos.Sleep

Operations

coyotos.SysCtl

Operations

coyotos.Window

Operations

coyotos.coldfire.Process

Structures

Operations

coyotos.i386.Process

Type Definitions

Structures

Operations

III Architecture Specific Annexes

A IA-32 Interface

A.1 Execution Models

A.2 System Call Trap Interface

A.3 Virtual Registers

A.4 Thread Identification

IV Notes on Implementation

9 Implementation of Capabilities

9.1 Unprepared Capabilities

9.2 Prepared Capabilities — Linked Implementation

9.3 Prepared Capabilities — Scavenged Implementation

9.3.1 Scavenging

9.3.2 Pros and Cons

10 Mapping Dependencies

10.1 Page Removal

10.2 GPT Dependencies

10.3 Optimizations


Acknowledgments

Many people have assisted us in evaluating and advancing this design:

Norm Hardy, Charlie Landau, and Bill Frantz of the KeyKOS project. Charlie also runs the CapROS project, another successor to the EROS system.

The members of the coyotos-dev mailing list, notably Bas Wijnen and Tom Bachmann, Christopher Nelson, Dominique Quatravaux, and Pierre Thierry.

The members of the L4 community, notably Hermann Härtig, Espen Skoglund, and Kevin Elphinstone.

The members of the Systems Research Laboratory at Johns Hopkins University, notably Eric Northup, Swaroop Sridhar, and M. Scott Doerrie.

The external participants in the kernel design review meeting of 28-29 March, 2007: Godfrey Vassallo, John Davidsen, Scott Doerrie, and Norman Hardy.

There are surely others that we will come to name as the design stabilizes further, and some that we will inadvertently omit. To the last, please accept our apologies. As is customary, any flaw remaining in this specification is ours.

Comments and suggestions concerning this specification are welcome. They should be sent to the coyotos-dev electronic mailing list. In order to send, you must be subscribed to the list. The subscription interface may be found at:

http://www.coyotos.org/mailman/listinfo/coyotos-dev.

In order to keep the mail archives readable, we ask that you send only ``plain text'' emails.

Preface

Coyotos is a security microkernel. It is a microkernel in the sense that it is a minimal protected platform on which a complete operating system can be constructed. It is a security microkernel in the sense that it is a minimal protected platform on which higher level security policies can be constructed.

The Original Plan

As originally conceived, Coyotos was intended to be a relatively minor departure from its predecessor, EROS [6]. EROS [4] was a small, robust microkernel whose central design ideas were pervasive use of capabilities [11] as the fundamental access model, an atomic, blocking capability invocation (therefore atomic and blocking IPC) model , and a persistent single-level store [5]. All of these features were inherited with some revision from the KeyKOS system. [2] Early application-level work on EROS, notably the defensible network system [10] and the secure window system [8] revealed areas where the EROS architecture would clearly benefit from refinement, but did not initially suggest fundamental shortcomings in the architecture. Coyotos was to have been that minor refinement, incorporating a new IPC primitive called ``endpoints'' and a revised memory mapping entity called a PATT. Our main goals were cleanup, consistency, and formalization.

For algorithmic reasons, the PATT idea did not survive into the current specification, and has been replaced by guarded page tables [ 3 ,1 ]. Though they were independently invented, guarded page tables may be seen as a generalization of the level skipping techniques of the KeyKOS translation mechanism or the Motorola MC68851 memory management unit [20]. The variant of guarded page tables incorporated here are modified to incorporate the fault handler and background space mechanisms of KeyKOS and EROS.

In January 2004, a summit meeting of sorts occurred between the several research groups working on L4 derivatives and Shapiro. The L4 Dresden group, in particular, wanted to get a better understanding of capability-based design and kernel mechanisms, with the intent that these would be adapted into the L4 architecture [9]. The new kernel architecture would come to be known as ``L4.sec''. There was some discussion of merging the two kernels, but no agreement could be reached on the future of L4's map and unmap operation. While the failure to merge the architectures was a disappointment, the idea that there would be a controlled experiment that would allow us to directly evaluate the map/unmap approach against the EROS node approach was a promising result in its own right.

Overrun by the Hurd

Events intervened in the form of Neal Walfield and Marcus Brinkmann, the current architects of the GNU Hurd system. The Hurd is a protected, object-based operating system that was initially constructed on top of the Mach microkernel. Mach has a variety of problems that have been thoroughly documented in the research literature. Of particular importance to Hurd are a lack of resource accounting mechanisms and poor performance. As a result of these issues, the Hurd project had provisionally decided to move to L4.

Unfortunately, modeling copyable, protected object references using L4's map/grant operations proved unexpectedly challenging. This left the Hurd project temporarily disrupted, leading Brinkmann and Walfield to seek more information about capability-based design. An extended discussion between Shapiro, Walfield, and Brinkmann at the 2005 Libre Software Meeting about capability systems in general and the plans for Coyotos ensued. As more information about the L4.sec design emerged [19], it became clear that copyable protected references might be problematic on the L4.sec interface as well. Walfield and Brinkmann traveled to Baltimore for a month-long set of design discussions in January 2006, leading to the current design for Coyotos.

In response to those discussions, we flirted for a while with introducing scheduler activations and a new IPC model. It didn't pan out. Initially, we thought that activations might be lighter weight than synchronous IPC. They aren't, and they introduce a lot of complexity in the exception handling model. Through editing errors, you may still find traces of that effort remaining in this document. If so, they are errors, and we would appreciate it if you might bring them to our attention.

Coyotos Today

The version of Coyotos described here has come full circle, and returns to the basic model of the EROS system. The primary differences are the introduction of endpoints, a first-class process object, and GPTs. It also reflects the January 2006 discussions between Walfield, Brinkmann, and Shapiro. As a result of those discussions, the architecture has been challenged a bit harder than it had been. Coyotos retains the atomicity and pure capability-based design of the EROS system.

Jonathan S. Shapiro, Ph.D.
The EROS Group, LLC
August, 2007

Chapter 1: Overview

This document describes the abstractions, objects, and interface specifications (capability types) implemented by the Coyotos microkernel. At some points it includes discussion of the intended model of usage by way of motivating or explaining what has been incorporated. Such discussions are non-normative.

All kernel-implemented objects are named and manipulated by means of capabilities, which grant varying degrees of authority according to the capability type. Developers can extend the system with new objects by deploying processes that implement the associated interfaces. Several such application-implemented objects are part of the core Coyotos system.

1.1 Microkernel Objects

The Coyotos kernel provides processes, GPTs (mapping structures), schedules, receive queues, pages, and a small number of other kernel objects.

Processes    Processes are the unit of execution, scheduling, and resource binding. A process names its address space, its schedule (which governs their execution timing) and their fault handler (which receives notice of exceptions).

Schedules    Schedules are an abstraction of computational resources. In order to execute instructions, a process must name (via a capability) the schedule under which it runs. The schedule, in turn, must convey authority to use one or more processors under a defined scheduling contract.

GPTs    GPTs are the unit of address mapping composition. An address mapping is defined as a mapping from addresses to capability slots, and is represented by a directed (potentially cyclic) graph of GPTs whose leaf capability slots name atomic storage units (pages or capability pages). A virtual address is divided into a virtual page address and a page offset. Valid virtual page addresses describe paths to leaf slots that contain data page or capability page capabilities.

Endpoints    An endpoint is a named rendezvous point between a message sender and a message receiver. Each endpoint carries a receiver-interpreted endpoint identifier. In addition, each endpoint provides means for ensuring that its capabilities can be used exactly once.

Receive Queues    Receive queues provide a means for several processes to receive from a single endpoint. The receive queue acts as a rendezvous point for the receiving processes. When a message is sent via the endpoint, the kernel will select a waiting process from the receive queue and deliver the message to that receiver. This permits kernel demultiplexing of receive processes, which enhances performance on multiprocessors.

Receive queues remain an experimental idea, and are not implemented by the current kernel.

Pages    Pages are the atomic unit of data and capability storage allocation. An address space consists of a lattice of GPTs whose leaves are pages. Pages are typed: a page may contain either data or capabilities, but not both. The size of a page is determined by the underlying hardware architecture.

There are a small number of other kernel-implemented capabilities. These primarily provide protected transformation operations on capabilities.

1.2 Entry Capabilities and Extensibility

Endpoint objects have ``entry capabilities''. An entry capability does not implement operations on the endpoint. Instead, it provides the means by which an application introduces new services. Any invocation of an entry capability is delivered to the providing object server.

1.3 Checkpointing and Persistence

Coyotos is a persistent object system. Main memory is treated as a cache of a larger backing store. Objects are loaded from backing store on demand and are rewritten to the backing store as a consequence of age or checkpoint. Following a system restart, persistent objects retain their state as of the last checkpoint. A checkpoint saves a ``consistent cut'' of the system. In consequence, processes are recovered in such a way that ongoing communications on the local machine may be resumed without recovery effort.

Secure Restart    On restart, any connection to the outside world is severed if continued communication on that connection might (conservatively) require re-authentication. In particular, network and terminal connections are terminated.

Lost Objects    One risk in this class of design is that objects may be permanently lost as a consequence of low-level storage failures (e.g. sector errors). When backing store is not already duplexed, the Coyotos object store implementation uses software duplexing of critical system structures. Applications may also use this mechanism if desired.

The checkpoint management interfaces used in a driverless kernel are still being refined, and are not yet included in this specification.

1.4 Process States and Exceptions

From the kernel perspective, a process has five run states: blocked, faulted, receiving, ready, and running. A blocked process is waiting for a kernel resource. A ready process is attempting to execute instructions and is waiting for a CPU. A running process is currently executing. A faulted process is not attempting to initiate instructions.

When a process incurs an exception, the Coyotos kernel synthesizes a message on behalf of the faulted process to a fault handler. It is the responsibility of the fault handler to decide what to do. The kernel does not define a fault handling policy.

1.5 Messages

From the sender perspective, message transmission is (nearly) atomic. From the receiver perspective, message transfer occurs asynchronously. Arrival is signalled by a message completion event delivered to the receiver's activation handler.

Relaxed Data Atomicity    Coyotos permits relaxed data atomicity for stateful messages. While a stateful receive is pending, the data bytes of the receive area are considered undefined and may be modified by the kernel to arbitrary values. When receipt has completed, the receive area is defined up to the kernel-provided length of the received message. The relaxed atomicity rule allows the kernel message send implementation to avoid a pre-probe pass on the received data area, which significantly improves performance. Note that the "undefined" rule explicitly does not apply to received capabilities. The complete set of capabilities (if any) transferred by a message are required to be transferred to receiver-controlled storage atomically. This requirement ensures that the inductive state transition requirements of the formal capability protection model are satisfied.

Blocking Send    A blocking send guarantees eventual delivery provided the operation completes and the receiver is not destroyed before delivery. Page faults at the receiver's designated receive location(s) will be delivered to the receiver-designated fault handlers as required. When fault handling has completed, the sender will retry the send operation from the beginning.1 Senders may implement watchdog timeouts on send operations by arranging to post exceptions to themselves after a timed delay.

Non-blocking Send    A non-blocking send will be silently discarded if any condition arises that would cause a blocking send to block. It will be truncated if a receiver page fault occurs during transmission. If truncation occurs, the receiver is notified of the partial delivery.

1.6 Naming and Invocation

Coyotos objects are named by capabilities. A capability is a kernel-protected value that names a resource and identifies some interface (equivalently: facet or object) of that resource. The interface in turn defines methods that the invoker can invoke by sending a message specifying the corresponding method code point. Thus, every invocation consists of a message send to a particular method of a particular interface of a particular resource, performed by invoking a capability. This is true both for server-implemented interfaces and kernel-implemented interfaces.

The Coyotos invocation mechanism is derived in part from the EROS design. The invocation payload has been enriched, but the invocation state model has been simplified. An invocation consists of a send phase followed by an optional asynchronous receive phase. The send phase may specify blocking or non-blocking behavior. If a non-blocking send is unable to make immediate progress, its message payload is truncated or dropped. The receive phase, if present, blocks until an incoming message arrives, and can optionally require that the incoming message arrive on a particular endpoint identifier.

The Coyotos kernel implements only one major system call: InvokeCap. A small number of additional system calls exist to implement pseudo-instructions such as capability load and store.

Entry capabilities contain a 32-bit protected payload field. The endpoints that they name contain a 64-bit endpoint identifier. Both values are delivered to the recipient as part of an incoming message. Neither is readable or modifiable by the capability's invoker. Servers may use these values to distinguish interfaces, object identities, permissions, or other desired characteristic.

1.7 Exception and Interrupt Handling

For reasons of performance, the Coyotos kernel handles scheduling-related interrupts directly. It does not specify or implement a policy for other interrupt handling. The kernel maintains a capability-named interface for interrupt handler registry. With the exception of low-level scheduling preemption, all policy and processing associated with interrupts is handled by application-level code.

The Coyotos kernel also pushes responsibility for exception handling policy to application level. When runtime application exceptions occur, the kernel delivers the state associated with the exception to an external fault handler designated by an endpoint.

1.8 Protection Model

An essential part of the security microkernel concept is that security policy — including mandatory security policy — should be implemented by application code. The code that enforces system-wide policy needs to be protected and must not be evaded, but it does not necessarily need to run in supervisor mode.

In keeping with this philosophy, the Coyotos kernel does not implement a security policy. Coyotos provides primitive protection support in the form of protected capabilities. Applications can invoke services only by invoking capabilities. Capabilities are kernel protected, and can be obtained only by transfer over capability-authorized channels. It has been shown formally that this restriction is sufficient to support (overt) confinement of subsystems [7], and that given overt confinement, a higher-level security policy can be implemented either by construction or by an application-level reference monitor [18].

A useful property of capability systems is that they directly express the ``relies-on'' relationships between components. If an object or subsystem A depends directly on a second object or subsystm B for its operation, then A necessarily holds a capability to B. In the absence of such a capability, A cannot invoke B at all (or even know if the existence of B). A key point here is that A may rely on B only in a qualified way, and (in some cases) may be able to take measures to guard against failures or hostility from B. This allows applications to take direct responsibility for their dependencies, and also to impose context-sensitive access restrictions on their providers.

For this reason, we try to avoid the term ``trust'' in our designs, preferring instead to use ``relies on.''

Part I: Microkernel Abstractions

Chapter 2: Capabilities

The Coyotos kernel implements a number of object types, each of which has a corresponding capability type:

Encoding Type Description Restrictions
0 Null

Universal, invalid capability.

1 Window

A local mapping window (Chapter 4).

RO,NX,WK
2 Background

A background mapping window (Chapter 4).

RO,NX,WK
3 KeyBits

Discloses the bit representation of capabilities.

4 Discrim

Classifies capabilities.

5 Range

Fabricates object capabilities.

6 Sleep

Interface to the kernel interval timer.

7 IRQ Control

Interrupt request line control interface.

8 Schedule Control

Interface to the kernel master scheduling table.

9 Checkpoint

Control capability for the kernel checkpoint mechanism.

10 ObStore

Interface between kernel and object store manager.

11 Pin Control

Permission to pin objects in memory.

12 Schedule

Permission to execute under a particular schedule.

13 SysCtl

Start, stop system, enter sleep states.

14 KernLog

Append text to kernel log.

15 IOPriv

Authority to read/write IO ports.

16 IrqWait

Authority to wait for an arriving interrupt.

17-31 Reserved

Encodings reserved for future use.

32 Endpoint

Control capability for an endpoint.

33 Page

Data page. The size of a page is determined by the underlying hardware page size.

RO,NX,WK
34 CapPage

Capability page. The size of a capability page is determined by the page size of the underlying hardware page size.

RO,NX,WK
35 GPT

Guarded Page Table. Used to compose larger address spaces from pages.

RO,NX,WK,OP
36 Process

Capability that manipulates the kernel process abstraction.

37 AppNotice

Capability that permits posting of non-blocking, application-defined software notices.

38-62 Reserved

Encodings reserved for future use.

63 Entry

Authority to send to the process designated by an Endpoint.

The RO, NX, WK, and OP restrictions respectively indicate, read-only, non-executable, weak, and opaque permission restrictions. These are described in detail in the chapter on address spaces.

2.1 Representation

A capability is 16 bytes, and uses the same representation on both 32-bit and 64-bit platforms. The capability structure is a ``tagged union'' whose details depend on the capability type field. The kernel is entitled to use optimized representations internally. The representation given below is the representation disclosed by KeyBits, which is the representation typically used on disk.

Except where otherwise indicated, reserved fields must be zero-filled. The P (prepared) bit and the hz (hazard) bit are kernel internal, and are always zeroed by keybits when the capability representation is returned.

2.1.1 Capabilities to Memory Objects

Memory capabilities include page, cappage, GPT, local window capabilities, and background window capabilities. All of these are used to describe portions of the address space. The format of page, cappage, and GPT capabilities is:

Figure 1.

Memory object capability

The format of a window capability is:

Figure 2.

Mapping window capability

The rootSlot field of the window capability is used only for local window capabilities, this field is reserved in background window capabilities.

Invariant:    l2g ≤ 64
Invariant:    (l2g == 64) ⇒ (guard == 0)2
Invariant:    ((guard << l2g) >> l2g) == guard3
Invariant:    l2g ≥ log2(page size)
Invariant:    (offset mod 2l2g) == 0

These invariants are ensured by the operations that fabricate the respective capabilities. The balance of the system is entitled to assume that they hold.

When traversing a memory capability, the virtual address va is defined as the bitwise concatenation of three fields g+u+v, where g is a variable length, possibly empty bit string that will be used as a guard value, u is a variable length, possibly empty bit string that will be used to index into the slots of the named GPT (if any), and v is the virtual address that will remain to be translated at the next step (if any). The length |v| is determined by the l2v field of the named Page, CapPage, or GPT. The capability field l2g contains length of the bit string |u+v|. The value of the effective guard is a multiple of 2l2g. For page and capability page capabilities, the value l2g also specifies the target page size. This is possible because neither pages nor capability pages have slots to be indexed.

2.1.2 Message-Related Capabilities

Endpoint capabilities currently do not carry permission bits, but are otherwise similar in layout to memory capabilities. The protected payload field is reserved in the respective control capabilities, and should be zero.

Figure 3.

Endpoint capability

Invocations of an endpoint capability ignore the protected payload and provide access to the kernel-implemented object. Invocations of an Entry capability are delivered to an implementing process designated by the endpoint. The protected payload field of the endpoint capability is provided as an additional output of the invocation.

2.1.3 Capabilities to Processes

The format of a process capability is:

Figure 4.

Process capability

2.1.4 Miscellaneous Capabilities

The format of a miscellaneous capability is:

Figure 5.

Miscellaneous capability

2.2 Valid Capabilities

Wherever this specification refers to a capability of a specific type, it should be taken to mean a valid capability of the stated type. The meaning of an invocation of a valid capability is determined by its implementation provider (kernel or server).

A non-object capability is any capability whose external representation does not include an allocation count. A non-object capability is always valid. Non-object capabilities are not revocable.

All other capabilities are object capabilities. An object capability is valid if and only if all of the following conditions are met:

  • There exists some object with a matching object identifier (OID) whose type is compatible with the type of the capability.4

    This condition may be violated if backing store is lost or corrupted, or through a bug in the object manager.

  • The allocation count in the capability matches the allocation count in the object.

    This condition ceases to be true when an object is revoked (see coyotos.range.rescind).

  • In the case of an endpoint capability whose PM bit is set, the protected payload field of the endpoint capability matches the protected payload field of the endpoint object that it names.

All other object capabilities are invalid. An invalid capability behaves in all observable respects as if it were the Null capability. This applies both to invocation of an invalid capability and to operations that act on invalid capabilities (notably KeyBits, which has implications for debugging invalid capabilities). The kernel is free to overwrite any capability location with a Null capability when it determines that the capability contained in that location is invalid.

2.3 Capability Prepare

Coyotos is an object paging system. Both object load and object unload are driven by the use of capabilities. Ignoring latency, this paging behavior is normally invisible to applications. The exception is that object page-in may reveal low-level storage failures that make an object unrecoverable.

Whenever a capability is used, the kernel internally performs a prepare operation on the capability. Conceptually, this prepare step is being done by the process that is performing the current system call. The prepare operation may have several outcomes:

  • If the capability is a non-object capability, the prepare operation succeeds (by definition).

  • If the capability names an object, but its allocCount does not match the allocCount of the corresponding object, the capability is re-written (in place) to the Null capability.

    The containing object is not marked modified. If other operations cause the containing object to be modified, the Null capability will be written to disk. Otherwise, subsequent reloads of the object will re-obtain the stale capability and this check will be performed again with the same result.

    Several optimizations and mechanisms are used to ensure that the disk allocation count does not overflow.

  • If the object named by the capability is not in memory, steps are taken to load it. The preparing process is enqueued to wait for the completion of this request, and re-starts its operation when the object has been loaded. In rare cases, this step may result in an ObjectContentLost exception if the backing store has experienced an unrecoverable storage error.

  • If the object named by the capability is in memory, it is locked for the duration of the current system call unless they are unlocked explicitly.

A capability is ``used'' if:

  • The capability is invoked by the current system call.

  • The capability designates the invokee of the current system call.

  • Fetching a capability argument or storing a capability result requires memory traversal, in which case all capabilities in the traversed slots are used.

  • The operation requested by the current system call accesses or mutates the target object of the capability.

Coyotos implementations are required to be atomic. This implies that all resource acquisitions (and therefore all capability prepares) must be acquired before any observable side-effect of a system call occurs.

2.4 Extensibility

Coyotos is an extensible object system in the sense of Hydra [16]. New objects may be introduced by designing a process that implements the desired object. Capabilities to these objects are implemented as Entry capabilities. The kernel checks these capabilities for validity, and optionally for a protected payload match (see Chapter 5), but does not otherwise define semantics for these capabilities.

Because the kernel does not know the semantics of these extensions, entry capabilities are not considered ``safe'' by the coyotos.discrim.classify operation.

Chapter 3: Processes

A Coyotos process provides an abstraction of the user-mode execution engine presented by the underlying microprocessor. From the kernel perspective, a process is the unit that is dispatched by the kernel for execution.

Coyotos does not distinguish between processes and threads. A process encapsulates a single kernel thread of execution. Coyotos address spaces are first-class objects. Two (or more) processes may be constructed that designate the same address space. This achieves concurrent execution of multiple kernel threads of control within a common addressing environment and resource pool.

Coyotos implements the system calls described in Chapter 6. Most of these should be viewed as software-defined instructions. The exception is the InvokeCap system call, which performs capability invocation (see Chapter 5). The majority of the kernel's function is provided in the form of kernel-implemented objects (equivalently: services) that are named by capabilities. These services are invoked in the usual way by invoking their capabilities.

3.1 State of a Process

The state of a process may be divided conceptually into kernel (privileged or sensitive) state and user (non-privileged) state. User state is that state which a process may modify directly without kernel intervention. This includes architecture-defined non-privileged register state. It also includes additional ``pseudo registers'' defined by Coyotos that support the capability invocation mechanism.

Kernel state is that state which records or discloses protection information, or for which the kernel must guarantee invariants for reasons of security, robustness, or operational consistency. The representation of capabilities, for example, is kernel state. The Coyotos process structure contains space to save both the kernel state and the user state of a process.

On some hardware architectures, the separation between kernel state and user state is not cleanly accomplished by the architecture. The most common examples of this involve design failures in the architected processor status word. The IA-32 eflags register, for example, includes state such as the supervisor mode bit and the current ``IO privilege level.'' Such fields present a problem because the balance of the eflags register must be modifiable by untrusted code. When an unprivileged application runs normal instructions, the hardware generally protects these bits from modification. On such architectures, Coyotos must ensure that any registers modified by get/set registers and similar operations properly protect these fields. The architecture-specific annex for each architecture identifies any such registers and their update constraints.

3.1.1 Per-process State

Figure 6.

Per-Process kernel state

Each process has the following state:

  • The process run state. This field indicates whether the process is running (0), receiving (1), or faulted (2).

  • The process flags word. This word contains the process run state and several bits that control fault-related and debugging-related behavior.

  • A software-defined notices bitfield, notices, indicating (by bit position) the software-defined notices that are pending for this process.

  • 32 capability ``registers'' that are implemented in software by the Coyotos kernel.

  • Capability slots that support process recognition and identification: brand and the cohort.

  • Capability slots that identify resources on which the process depends: the address space, the schedule, and the external fault handler.

    Coyotos address spaces are ``first class''. An address space may exist without having any associated process. Multiple processes may name the same address space by placing the same address space capability in their respective address space slots. Schedules are similarly ``first class.''

  • Slots related to exception handling.

    The faultCode and faultInfo are conceptually similar to the underlying hardware processor's exception registers. A process-incurred exception causes these registers to be updated with the information necessary for error diagnosis and possible resolution. The exception fault code space unifies both hardware-defined and kernel-defined exceptions into a single code point space.

    The handler capability slot contains an entry capability to the external fault handler (if any). This is an external process that should be notified whenever this process incurs a fault.

  • Storage for the architecture-defined non-privileged register set. Access to these registers is by means of invocations on the architecture-specific process capability.

The per-process capability state, fault code, and fault information can be accessed and manipulated only through invocations of the process capability.

The process flags are shown in Figure 3.2. The fields have the following meanings:

Figure 7.

Process flags word

Field Meaning
xm

Execution Model    indicates whether this process uses a 32-bit (0) or 64-bit (1) execution model. This bit is significant mainly on architectures having multiple execution models, such as amd64. It controls certain aspects of cross-model invocations.

sx

Slice Expired    This bit is set by the kernel when the process's real-time slice has expired. The slice expired event is considered an application-defined interrupt. Use and delivery of the sx notification is discussed in Chapter 7.

sn

Soft Notice    This bit is set by the kernel whenever a new bit is set in the pending application-defined notices field.

tc

Trap On Call    indicates that the process should incur a ``trap on syscall exception when it attempts to perform a system call. This trap will occur after registers have been saved to the process structure, but before arguments have been examined by the kernel. In particular, the system call number will not yet have been examined by the kernel.

tr

Trap On Return    indicates that the process should incur a ``trap on system call return exception when it exits or bypasses the receive state following a successful invocation. This trap occurs just after the parameter words (if any) have been copied out to the application. In consequence, it occurs after any associated exceptions are processed by the recipient. Control has not been returned to the receiver.

If this bit is set at process system call return, a process.FC_SysCallReturn fault will be set in the process state just prior to returning. The process.resume() does not cause the invokee to resume in the system call exit path, so this exception will not re-occur on resumption.

cs

Call Step    if set (1), indicates that the tc bit should be ignored at the next point where it would normally take effect.

This bit is set as a side effect of the process.resume() operation if the currently pending fault code is process.FC_SysCallEntry. It is cleared whenever the process proceeds successfully to the commit point of the current system call.

pc

Parameter Copyout    This bit is set by the kernel whenever a parameter copyout from the parameter scratchpad area is required before resuming user-mode execution of the current process.

3.2 Execution Model

The instruction set available to a Coyotos process consists of the user mode (non-privileged) instruction set of the underlying processor architecture, the kernel-implemented InvokeCap instruction (which is the subject of Chapter 5).

From the perspective of the kernel, a process exists in one of the following run states:

blocked

Process is attempting to send, but is blocked availability of a kernel resource. Process has no current or pending software interrupts. On release, process will resume in the ready state.

receiving

Process is waiting for an incoming message from an endpoint, and has no current or pending software interrupts. On receipt, process will resume in the ready state.

ready

Process is attempting to execute instructions. Process may have current or pending software interrupts. Pending exceptions will be delivered when the process transitions to the running state.

running

Process is assigned to a CPU and is executing instructions.

faulted

Process has incurred an exception that has been reported to the external fault handler designated by the process's handler capability. Process will not executing instructions.

The state transition diagram is shown in Figure 3.3.

Figure 8.

Process state transitions

The blocked state is not externally observable. A blocked process has an externally reported runState of ``running''. Such a process is deemed to be running without making progress or consuming CPU cycles. The receiving state it also not externally observable. A receiving process is executing the receive phase of a capability invocation very slowly.

A process that is in the running state will initiate instructions as long as its process faultCode field is set to FC_NoFault (0). Execution behavior when any other value is stored in the faultCode field is discussed in Section 3.3.

3.3 Exception Handling

An exception occurs as the result of an instruction executed by the process. Every exception has an associated fault code. Specific exceptions may define an additional pointer value to be delivered as additional fault information. The fault code and fault information are delivered to the process by storing them in the faultCode and faultInfo fields of the Process and causing the process to resume execution.

3.3.1 Exception Delivery

When a process attempts to initiate instructions with a faultCode other than FC_NoFault, the behavior is as follows:

  1. If an Entry capability is stored in the process's handler slot, the kernel synthesizes a message to this endpoint on behalf of the faulting process. The message will provide the faultCode and faultInfo values and a process capability to the faulted process. Disposition of the faulted process is now at the discretion of the fault handler.

    If the handler process is blocked, handler message delivery will be re-attempted when the external handler process becomes unblocked.

  2. The process enters the faulted state (runState = stopped) and ceases to execute instructions.

In the absence of a specified external handler, a process attempting to deliver a fault notification to its external handler will effectively cease to execute instructions without notice to anyone. It is the responsibility of the programmer to ensure that an external handler capability is defined if noticing this condition is required.

Note that the state of the per-process handler slot is checked on each delivery attempt. If a process blocks attempting to deliver its fault information to an external fault handler, and the handler slot is modified before the external handler becomes unblocked, the fault may end up being delivered to a different handler or to no handler at all depending on the new value of the handler slot.

3.3.2 Return From Out-of-Process Handler

If an exception has been delivered to a handler, the handler must take action to clear the fault. It does this by invoking the process capability provided by the kernel upcall to clear the fault and return the process to the running state.

3.4 Application-Defined Notifications

Coyotos supports application-defined non-blocking notifications via the AppNotice capability type. A notification is posted by invoking the AppNotice capability with 32-bit mask indicating the notifications (in the range 0..31) to be posted. The set of authorized notifications is determined at the time the AppNotice capability is fabricated. The effect of posting a set of application-defined notices is to set the corresponding bits of the target process softNotices word, and to set the sn bit of the target process flags if the value of notices has changed as a consequence of this posting (i.e. if the notice was not already pending). Of the set of notices posted, only the authorized subset is delivered.

If any notices are pending when the recipient enters an open wait, they will be delivered as a message, with a specified endpoint ID of ~0ull and a protected payload value of zero. Delivery of pending notices has higher priority than other incoming messages.

Delivery of application-defined notices is suppressed during a closed wait.

Chapter 4: Address Spaces

The Coyotos architecture defines 64-bit address spaces for both 32-bit and 64-bit machines. On 32-bit machines, the leading 232 byte positions are addressable by hardware load and store instructions. That is, the hardware-accessible map is a window onto the leading subrange of the software-defined space.

On some architectures, a portion of the hardware-addressable space may be reserved for use by the kernel. On such machines, the hardware-accessible address space is overlaid by the kernel-defined region.

4.1 Memory Objects and Address Interpretation

Three objects are used to define Coyotos address spaces: pages, cappages, and GPTs. Capabilities to these objects may be invoked in the usual way. The interface definitions for these objects are provided in Part II.

The meaning of a data (capability) address reference is determined by starting at the data (capability) address space capability of the referencing process and traversing memory objects until the address has been successfully translated or an exception has occurred. The traversal process is similar to the traversal of hardware-based hierarchical translation tables, but there are several differences:

  • The Coyotos mapping structures provide support for per-region fault handlers. Any region of size 2k pages may have an associated fault handler. When a memory fault is reported to the in-process fault handler, the in-process handler may optionally forward memory fault messages to the per-region handler in order to request region-specific fault handling.

  • The ``levels'' of the mapping hierarchy are dynamically determined. Smaller subspaces may appear where a larger space is expected, with the effect that the ``missing'' regions are considered invalid addresses. Larger subspaces may appear where a smaller space would naturally appear, with the effect that only the leading subrange of the larger subspace is addressable through this mapping.

  • A mechanism is provided for mapping ``windows'' onto other address spaces by reference. This enables one address space to map (portions of) another even when the second space is opaque.

  • In order to support certain essential types of addressing flexibility — notably windows — it is necessary to allow some unusual arrangements of the hierarchical structures. An unfortunate consequence of this is that it is possible for a hostile or erroneous program to create statically cyclic address spaces. Such spaces are malformed, and attempts to reference a cyclically defined address generate a MalformedSpace exception.

4.1.1 Permissions

All memory object capabilities carry a four-bit field, "restr", which specify restrictions on which types of access may legally be performed:

RO

Read Only (0x1)    Attempts to perform write references along any address translation path that traverses this capability are prohibited.

NX

No Execute (0x2)    Instruction fetch references along any address translation path that traverses this capability are prohibited. Attempts to perform instruction fetches at such addresses generate an NoExecute exception.

Issue

I have not yet examined the exception handling policy for machines that implement NX to confirm that a differentiated access violation type is generated at the hardware level.

On hardware that does not support the NX restriction, the NX bit is ignored.

WK

Weak (0x4)    A capability read reference along an address translation path that traverses a capability with this bit set conservatively downgrades the returned capability, if required, in a way that ensures transitively read-only authority.

Capability and data stores that traverse a weak capability in the translation path generate an access violation exception.

OP

Opaque (0x8)    The address space structure may not be accessed or modified through any GPT capability with the Opaque bit set.

The result of translation of the form translate(space,addr,access-type) is either an exception or a valid translation of the form (page,offset). If an exception is generated, the type of the exception and the originally referenced address are reported to a handler (if one is defined), the faulting instruction (if any) has no effect), and the program counter is not advanced. The defined reference types are:

Fetch

Instruction load from address space.

Load Data

Read data from address space.

Load Capability

Read capability from address space.

Store Data

Write data to address space.

Store Capability

Write capability to address space.

4.1.2 References and Access Violations

The rules for address translation are given below in the discussions of individual memory objects. As traversal of the memory objects proceeds, the effective restrictions associated with the address are computed by beginning with no initial restrictions and performing a cumulating logical or with the restriction bits in each traversed capability as translation progresses.

If a capability is traversed during translation that cannot legally appear within an address space, a Malformedpace exception is generated according to the reference type.

If the traversed path is well-formed, but the address cannot be completely translated, an InvalidAddress exception is generated according to the reference type. Untranslatable fetch references generate the InvalidAddress exception.

If an address is completely translatable, the resulting permission restrictions may not permit the reference type. In this case an exception will be generated according to the following rules:

Ref Type Permissions Result
Fetch NX Exception: NoExecute
Capability Store, Data Store RO or WK Exception: AccessViolation

If the permissions are sufficient to allow the operation, a final check is made to ensure that the type of the load or store operation (data or capability) matches the type of the page mapped at that address (Page, CapPage). If a type mismatch occurs, a DataAccessTypeError or CapAccessTypeError exception is raised.

A capability load that traverses a path having WK restrictions will succeed, but will return a downgraded result as follows:

Capability At Address Result
Page, CapPage, GPT, Window, Endpoint Copy with RO, WK bits set.
Discrim Return value is unchanged.
other Null capability is returned.

4.2 Pages and Capability Pages

The smallest mappable unit, and therefore the smallest address space, is the page or the cappage. A page is the atomic unit of data storage whose size is implementation-defined. A capability page is a page-sized unit that holds capabilities rather than data. Capabilities are byte-addressed opaque 16-byte quantities that are aligned at 16 byte boundaries.

Coyotos implements a single page size whose size matches some hardware page size implemented by the underlying hardware. On processors that implement multiple page sizes, the selected page size need not be the smallest size supported by the underlying hardware. It is implementation-dependent whether the kernel will attempt to exploit larger hardware page sizes if available. If such exploitation is attempted, it is accomplished by re-synthesizing larger pages by physical arrangement of standard-sized pages. The atomic unit of mapping and permissions remains the Coyotos page size.

A page capability may be inserted into the address space slot of a process, with the effect of defining an address space having valid offsets between [guard,guard+pgsize-1]. Attempts to reference offsets outside this range result in an invalid address exception.

Capability pages are byte-addressable units. However, capabilities must be stored and referenced at naturally aligned (16 byte) boundaries.

Address translation of an address addr with respect to a page or cappage capability is defined as follows:

  1. If the value of addr exceeds the page size, an InvalidAddress exception is generated.

  2. Otherwise: the addr is a valid offset, and the overall address reference is valid.

4.3 Address Space Composition

Address spaces are composed by means of the GPT object. A GPT is simply a fixed-length vector of capabilities (currently 16), each of which is paired with a guard. In Coyotos, the guard has been incorporated into the capability format itself. The state of a GPT is shown below.

Figure 9.

GPT State

Invariant:    l2v ≥ log2(page size)

The meanings of the GPT fields are:

l2v

Subspace size    Each slot of the GPT names a subspace of size 2l2v bytes.

ha

fault handler    Slot 15 of the GPT contains an Entry capability to the fault handler.

Care should be taken to set the l2v value appropriately when the ha bit is set. If the translation algorithm traverses an Entry capability in the normal course of translation, a malformed space exception will be generated.

bg

background space    Slot 14 of the GPT contains a memory capability to a background space (see window capabilities).

Care should be taken to set the l2v value appropriately when the bg bit is set. If the translation algorithm traverses a background capability during the normal course of translation, the translation result will appear as if a larger space was entered.

cap[0..15]

Capabilities to subspaces.

When the ha or bg bits are set, it is the responsibility of the process managing the GPT to ensure that the l2v value prevents collision.

4.3.1 Translation Algorithm

Note:    In the discussion that follows, it may be useful to refer to the capability representation for window and memory object capabilities (see Chapter 2), with particular reference to the l2v and l2g fields.

Address translation is performed by translating an unsigned virtual address va with respect to some memory capability C (a GPT, page, capability page, or window capability). Translation begins at the address space capability of the process structure with a 64-bit virtual address. In the normal case, the progress of translation causes bits to be ``consumed'' from the left, leading to virtual addresses of progressively smaller magnitudes. Window capabilities, however, may cause the remaining virtual address to grow as translation proceeds.

The virtual address va that is currently being translated is conceptually divided into three fields g, u, and v. The g field (which may be zero width) contains the guard value. The u field contains the index value that will be used to index into the next GPT. The v field contains either the address bits that will remain to be translated when the current step has completed (GPT or window capability) or the page offset bits (page or capability page capability).

Figure 10.

Virtual address structure

In reading the following section, recall the invariants described in Section 2.1.1. These are checked at capability fabrication time, and are assumed to hold by the following algorithm statements.

The values of g, u, and v are computed from the capability C and the address va as follows:

g = va >> C.l2g;
guard = C.guard << C.l2g;
u = (va - guard) >> C.l2v;
v = va & ((1u << C.l2v) - 1);

At the start of translation, the background space capability Cbackground and the memory handler capability Chandler are initialized to the Null capability, the virtual address is as provided by the hardware (or possibly the IPC logic) and the effective access restrictions AR is the empty set. Translation proceeds by iteration, with each iteration performing the following steps in sequence:

  1. The g value is compared to the zero-extended guard value stored in the capability. If they do not match, the address is invalid and an InvalidAddress exception is generated.

  2. If the u value exceeds the number of slots in the GPT, the address is invalid and an InvalidAddress exception is generated.

  3. The effective access restrictions are updated from the capability C by:

    AR:=AR+C.restr

    If the resulting effective access restrictions are insufficient for the requested access type, an AccessViolation exception is generated.

  4. Processing now proceeds according to the capability type:

    • If the capability type C.type is Page or CapPage, translation has completed successfully.

    • If a local or background window capability appears in the address space slot of a process, all addresses are deemed invalid.

    • If the capability is a local window capability appearing within some GPT, translation proceeds from the capability contained in the rootSlot slot of the GPT containing the local window capability at the offset named by the capability.

      va := v + C.offset
      C := containingGPT[C.rootSlot]

      Note that the invariants of Section 2.1.1 guarantee that there is no bitwise overlap between v and C.offset. That is: the addition can be correctly implemented as a bitwise ``or'' operation.

    • If the capability is a background window capability, translation proceeds from the capability to the background space with

      va := v + C.offset
      C := Cbackground

      Note that the invariants of Section 2.1.1 guarantee that there is no bitwise overlap between v and C.offset. That is: the addition can be correctly implemented as a bitwise ``or'' operation.

      Recall that Cbackground is initialized to Null at the start of translation. If no other background capability has been defined at the point where the background window capability is encountered, all addresses that fall within the background window are invalid.

    • If the capability type is GPT, translation proceeds with

      gpt := target-of(C);
      if (gpt->bg)
        Cbackground = gpt->cap[14];
      if (gpt->ha)
        Chandler = gpt->cap[15];
      va := v
      C := gpt->cap[u]
    • If the capability is a Null capability, an InvalidAddress exception is generated.

    • Otherwise, a MalformedSpace exception is generated.

4.3.2 Exception Handling

If an exception is generated by the translation mechanism, and the memory exception handler capability Chandler is not Null, then the exception will be delivered to the memory exception handler. Otherwise, the exception type and address are stored in the process's faultCode and faultInfo slots, respectively, and the process is set running with the pending fault code, and the exception is then delivered as described in Section 3.3.1.

4.3.3 Cycle Detection

It is possible for an erroneous or hostile program to arrange GPT objects in such a way as to create a static cycle. Such an address space is malformed, and attempts to traverse such a cycle during address translation result in an MalformedSpace exception.

No final selection has been made for a method of cycle detection. Three rules have been proposed:

  1. A bound on the total number of GPT structures that will be visited before generating a MalformedSpace exception.

    This method has been rejected. It has the unfortunate property that existing, valid addressing structures can be rendered invalid by ``splitting'' an existing GPT. We want to preserve the ability to split without semantic alteration in order to be able to map subspaces.

  2. A bound on the total number of capabilities that do not translate new bits that will be visited before generating a MalformedSpace exception.

    This method keeps track of |vleast|, the shortest virtual address that has been obtained by translation to the current point. If C.l2v≥|vleast|, then the current capability does not translate new bits.

    This method has been rejected. It has the unfortunate property that existing, valid addressing structures can be rendered invalid by ``splitting'' an existing GPT. We want to preserve the ability to split without semantic alteration in order to be able to map subspaces.

  3. A bound on the total number of bits visited for translation, defined as the cumulative sum of (|va|-|v|) for all capabilities visited during a translation attempt.

    This approach preserves the possibility of a correctness-preserving split operation.

All methods of cycle detection introduce a complication for implementers: the validity of addresses within a subspace is contextually dependent on the number of bound-countable events in the prefix path leading to that subtree. This means that two process address spaces may both have some subspace mapped at otherwise valid subspaces addresses, and selected subranges of the mapped subspace may nonetheless be valid in one space but not in the other.

Because of this problem, care must be taken when implementing page table sharing to ensure that page tables are shared only when all possible references through that hardware table are equally valid in all referencing contexts. If this is not done, one process would be able to produce valid mappings in the hardware mapping table that would be usable by the second, even though the second lacks the ability to produce those hardware mappings for itself.

4.4 Address Space Splitting

Experimental

The feature described in this section is experimental. It is not presently implemented, and may be removed in future versions of Coyotos.

In order to support the subspace transfer item described in the capability invocation chapter, Coyotos introduces a new type of exception that may occur in an address space: the SplitFault.

Split faults allow an invoker to send a single capability to an arbitrary 2k page region of an address space, provided that the region is naturally aligned and the invoker has sufficient access rights to extract the dominating capability. Similarly, they permit a receiver to generate appropriate ``holes'' into which such a capability must be received.

The problem solved by split faults is that there may not be any naturally dominating GPT for the subspace. For example, in a system having 4 kilobyte (212 byte) pages, the invoker may wish to transmit a 211 page (223 byte) subspace, but the subspace may currently be dominated by a GPT having l2v=21. That is: there is no single slot in the GPT that directly holds a capability of the desired span. Before a single dominating capability can be sent, this GPT must be ``split'' into an arrangement where the target subtree has a single dominating GPT with l2v=23. When such a send is attempted, the invoker will receive a SplitFault exception. This is an advisory that the GPT must be split in order to bring a dominating GPT into existence.

Similarly, if a receiver specifies a ``hole'' of some size 2hlsz pages, there must exist some GPT in the receiver tree that could receive (with an appropriate guard value) a capability dominating a tree of the requested size.

The reason this feature is considered experimental is that the correct strategy for splitting GPTs is not obvious.

The address space splitting idea is not yet fully developed. There are certainly holes, including necessary but undefined exception types, that need to be resolved in the definition above.

Chapter 5: Capability Invocation (including IPC)

Coyotos is an object-based system. A process wishing to perform an operation (equivalently: invoke a service) does so by invoking some capability that it holds. The capability has a defined interface that specifies some set of invokable methods, including their argument and return types. The provider of these methods may be either the kernel or an application; the invocation mechanism is the same in either case. That is: Coyotos is an extensible object system [16]. The primary system call in Coyotos is the ``invoke capability'' system call (Section 6.5). Other system calls defined by the Coyotos specification may all be viewed as convenience wrappers for capability method invocations.

Because kernel-provided and application-provided services share a common invocation mechanism, it is necessary to specify both the low-level binding of capability interfaces and the externally observable semantics of capability invocation. While the specific binding is architecture-dependent, this chapter includes recommendations on bindings that suffice for most platforms.

The invoke capability system call implements a variant of the SendAndWait primitive proposed by Liedtke [12] or the call and return primitives of EROS [4]. The send phase of the invocation can be blocking or non-blocking. If a non-blocking send is performed, some or all of the message may be truncated. The receive phase may wait for an arbitrary endpoint (an ``open wait'') or a specific endpoint (a ``closed wait''). The receive phase is optional.

5.1 Invocation Payload

An invocation passes a message that consists of:

  • Up to 8 direct words, the first of which is the invocation control word. The size of these words is architecture dependent. These words may be carried in registers or memory, as specified by the architecture-specific annex for the target platform. The index of the last word transmitted is given by IPR0.ldw.

    Input parameter word 0 of the invoke capability operation contains control information describing the rest of the message payload:

    Figure 11.

    Invocation control word (input)

    Provided the invokee is valid and well-formed, a message consisting solely of untyped parameter words is guaranteed to proceed without exceptions on all architectures.

  • Up to four capabilities. Capabilities are transmitted if IPR0.SC=1. If so, IPR0.lsc gives the index of the last capability transmitted. Capabilities are received if IPR0.AC=1. If so, IPR0.lrc gives the index of the last capability that will be accepted.

  • An indirect string of up to 64 kilobytes. The length of this string is given in a parameter word to the system call.

In addition to the payload of the invocation, the invoker specifies:

  • The capability to be invoked.

  • Whether they are willing to block in order for the message to be delivered (IPR0.NB).

  • Whether to fabricate a reply capability (IPR0.RC).

  • Whether the receive phase should be performed (IPR0.RP).

  • Whether copy-out of soft registers should be performed on those architectures that define soft registers. (IPR0.CO).

  • Whether the receive phase should accept messages only from a particular endpoint (IPR0.CW, upcb.rcvEpID).

If a receive phase is executed, the receiver receives the following information in addition to the invocation payload:

  • The endpoint identifier of the Endpoint on which the invocation was received.

  • The ``protected payload'' of the capability that was invoked.

  • The length of the string that was sent, it any.

  • A modified copy of the invocation control word, which indicates various information about the incoming message. In this returned word, the u, RC, and SC, fields are copied from the sender's input invocation control word. The lsc field indicates the number of capabilities that have been received. The ldw field indicates the number of data words that have been received.

The protected payload and the endpoint ID can be used to determine the receiver-defined context in which the received message should be interpreted. One common use of these fields is for the endpoint ID to identify the object invoked and the protected payload to identify the permissions on that object.

5.2 Invocation-Related Exceptions

Exceptions may occur during invocation on either the sender or the receiver side of the transmission. All such exceptions logically occur before the invocation. In practice, exceptions are generated as a consequence of payload transfer. If an exception occurs, the implementation is free to resume the transfer at the point of interruption if it is able to do so. However, the receiver of an interrupted transmission logically reverts to the beginning of its receive phase when an exception occurs. In the event that a second sender is attempting to send when a messaging exception is incurred, the second sender's message may prevail.

If the sender specifies non-blocking transmission, the transfer of indirect strings and capabilities is ``best effort.'' If the receiver incurs a page fault during the receipt of an indirect string or a capability argument, that argument will be truncated. In this case the receiver will be notified of truncation, but no receiver-side exception will be generated.

The meaning of a non-blocking send is that the sender is unwilling to be blocked for any cause whose handling is controlled by the receiver. The use-case for this option is a server returning a reply to an untrusted client. For purposes of understanding truncation, a hardware page fault that is successfully resolved by the object paging subsystem is not considered to be an architecturally observable fault. Similarly, an exception that can be satisfied by reconstructing a hardware mapping entry from an already defined GPT hierarchy is not considered an architecturally observable fault.

5.3 Endpoints

A process that wishes to accept capability invocations does so by means of one or more endpoints (Figure 5.2). Endpoints have two capability types: the Endpoint capability, which implements the control interface for the endpoint object, and the Entry capability, which provides the means for extending the object system. When an Entry capability is invoked, the invocation parameters are delivered as a message to the process named by the recipient field of the Endpoint.

Figure 12.

Endpoint structure

The meanings of the endpoint fields are:

Field Meaning
pm

Payload Match    Indicates that the protected payload of the endpoint should be compared to the protected payload of the Entry capability. If they are not equal, the invocation behaves as if the Null capability had been invoked.

protPayload

Protected Payload    This value will be conditionally used as a matching value if PM is set (1).

endpointID

A 60-bit field having meaning only to the recipient. The value of this field will be delivered to the recipient during message receive.

recipient

A process capability to the receiving process.

Non-Normative Illustration

When a server implements a single logical object, it will typically operate with two endpoints. The first is the one used to invoke the service (the ``receive endpoint''). The endpoint ID of this endpoint is not used. The protected payload of the corresponding Entry capability may be used to express distinct permissions or restrictions on the permitted operations. The second endpoint is used to accept replies (the ``reply endpoint''). The endpoint ID of this endpoint is used as a matching value to implement a closed wait so that unrelated messages are not received where a reply is expected. The endpoint's protected payload is used to ensure that no more than one reply will be received (by means of the IPR0.RC bit of the invoke capability system call).

When a server implements multiple objects, a distinct receive endpoint is typically allocated for each object implemented by the server. In this case, the endpoint ID is used to identify which object or service is being invoked, and the protected payload field of the corresponding Entry capability is used to express distinct permissions or restrictions on the permitted operations on that object.

Non-Normative Note on Reply Endpoints

If the IPR0.RC bit is set in the invocation control word parameter, the protected payload of the endpoint is pre-incremented before the Entry capability is fabricated. The purpose of the RC bit is to allow a caller to ensure that a call/return sequence receives at most one reply in the normal case. This is accomplished by ensuring that stale reply capabilities are invalidated (by protected payload mismatch) before the next receive on the reply endpoint is performed.

Whenever an Entry capability is invoked, the invokee receives the protected payload value of the invoked Entry capability. In the case of a reply endpoint, the PM bit is set, so the received protected payload value matches the value stored in the endpoint.

It is the responsibility of the application to notice when the incoming protected payload value approaches UINT32_MAX. In this situation, the pre-increment will overflow the protected payload counter when it is next used. The recommended solution for this is to obtain a new reply endpoint from a space bank when the protected payload reaches UINT32_MAX-1.

5.4 Semantics of Kernel Capability Invocation

To ensure consistent invocation behavior, it is necessary to specify the externally observable behavior when a kernel-implemented method is invoked. In particular, the observable effect on process state and the sequencing of operations and events during a kernel invocation must be defined.

When a kernel capability is invoked, the externally observable behavior should be as if the invoker had invoked an endpoint to some application providing the service. Because no kernel operation accepts an indirect string, the invocation of a kernel capability behaves as if this hypothetical provider had performed a receive phase with IPR0.AS=0 (no strings will be accepted). This hypothetical provider arrives at the specified answer and accomplishes any effects of the invocation by unspecified means. It then replies as if it had invoked the InvCap system call with the control bits of the first input parameter word set as follows: NB=1 (non-blocking), RC=0 (no reply capability is generated), CW=0 (the kernel conceptually enters an open wait state), and RP=1 (the kernel waits for the next invocation). In addition, the SC (send capabilities) control bit will be set (clear) if capabilities are (are not) returned by the method. Note that because the kernel reply is non-blocking, and the kernel is deemed to be in the running state until it has replied, the reply from a kernel-implemented capability cannot cause a second kernel-implemented capability to be invoked.

This statement of behavior has (at least) the following implications:

  • The effects of a kernel capability invocation occur whether or not the invocation returns successfully, provided any preconditions specified for the method are satisfied.

  • There exist several kernel operations that alter the state of a process. When the process altered is also the process receiving the kernel reply (the ``invokee''), the kernel behavior must be well-defined. There are two such cases:

    1. The invokee process is destroyed as an effect of the invoked method. In this case, the reply proceeds as if via an endpoint that contains a Null capability.

By intention, kernel-implemented operations satisfy two invariants that simplify or eliminate other potentially obscure corner cases:

  • No kernel-implemented interface accepts or returns an indirect string.

  • Kernel methods that modify address space mappings or revoke objects return only scalar return values (and therefore behave as if SC=0). This ensures that changes in the meaning of the receive capability capitem_t values cannot impact the return of these operations.

Undefined Locations    The content of receive buffers, receive parameters, and receive capability locations is undefined between the start of the IPC receive phase and the completion of the IPC receive phase. For performance reasons, the kernel is entitled to arbitrarily modify state whose content is undefined during invocation. In particular, the kernel is entitled to modify the receive parameters or the receive string buffers of a waiting process without releasing that process from its wait state. This allows the kernel to more efficiently implement indirect string moves that may induce invoker or invokee page faults during the transfer. This means that all of the receive string buffers of a recipient may be modified during receive, even if the final message received sends only a single indirect byte. Similarly, any valid receive capability locations may be overwritten even if a smaller number of capabilities was transferred.

State Transitions    The overwhelming majority of kernel capability invocations return to the invoker without generating any exception. In these cases, the kernel may behave as if the operation occurred instantaneously, with the consequence that the invoker may never be observed to leave the running state.

Elided Reply Capability    When a kernel capability is invoked and replies to the invoker without an exception, the kernel implementation is free to elide the fabrication of the reply capability. Elided reply capabilities are observable because the protected payload value of the reply endpoint will not be incremented.

Chapter 6: System Calls

Coyotos currently defines three system calls:

Number Name Description
0 InvokeCap

Invokes a capability and (optionally) waits for a reply on an endpoint.

1 reserved

Reserved for future use.

2 CopyCap

Copy a capability from one location to another.

3 Yield

Yield the processor, moving the current process to the back of its scheduling class.

4..15 reserved

Reserved for future use.

6.1 Parameters and Parameter Words

At the system call trap interface, arguments and return values are conveyed by means of a combination of registerized parameter values and (optionally) a system-call specific stack frame. Every architecture-specific annex specifies a subset of hardware registers that are be used to convey system call parameters. No annex defines fewer than four registers to be available at this interface. Where a specialized stack frame is specified, the architecture-specific annex may specify that some or all of that stack frame is conveyed across the user/supervisor boundary in registers. The corresponding fields of the stack frame will never be accessed by the kernel.

The Coyotos system call specification ensures that all arguments and return values of system calls other than InvokeCap can be marshalled in registers. In addition, the majority of kernel-implemented capabilities, including all capabilities likely to be invoked in performance-critical application paths, can be invoked without a string parameter.

In the system call specifications that follow, the notation IPRn and OPRn indicate input and output parameter registers, respectively. Except where required by the architecture-specific system call mechanism, or explicitly noted by the system call, output registers retain their value at the time of system call entry.

6.2 Exceptions

The following exceptions may be incurred by the caller during system call execution.

Exception Cause
MalformedSyscall

The operand was malformed. This includes field value range errors or reserved type codes.

The faultInfo field is zero.

MisalignedReference

The operand specified a capability address, but the address described is not aligned to a 16-byte boundary.

The faultInfo field contains the errant address value.

InvalidAddress

The operand specified an address that is not defined.

The faultInfo field contains the errant address value.

AccessViolation

A store operation was attempted, but the operand specified an address that does not permit write access.

The faultInfo field contains the errant address value.

DataAccessTypeError

The address specified by a data load/store operand does not reference a data page.

The faultInfo field contains the errant address value.

CapAccessTypeError

The address specified by a capability load/store system call does not reference a capability page.

The faultInfo field contains the errant address value.

MalformedSpace

The address specified by the operand violated the well-formed address space constraints.

The faultInfo field contains the errant address value.

Any system call may generate the MalformedSyscall exception if bits marked ``reserved'' are non-zero or specified field value bounds are exceeded. Individual system call descriptions below specify which of the other exceptions may be incurred by that system call.

6.3 Capability Locations

Figure 13.

caploc_t structure

A caploc_t parameter (Figure ) describes a generalized capability location that is either a capability register (ty=0) or a memory address (ty=1).

The encoding of register caploc_t values is identical to the encoding used for capreg_t values, modulo the wider location field. When the caploc_t describes a memory address, the location field holds the most significant bits of the address. Capability addresses are required to be 16 byte aligned. In consequence, the expression of valid capability addresses is not restricted by the re-use of the least significant bit for this purpose.

The size of a caploc_t matches the architecture-defined word size.

6.4 Pseudo-Instructions

The Yield and CopyCap system calls are best thought of as pseudo-instructions.

6.4.1 Yield [syscall]

The Yield system call relinquishes the processor. If the I bit is clear (0), the yielding process is placed at the end of the appropriate ready queue.

Parameter Format
IPR0

The Yield system call does not return any output parameters.

Open Issue

Should there be a directed yield operation? If so, the yield system call needs to take a second parameter.

6.4.2 CopyCap [syscall]

Parameter Format
IPR0
IPR1
IPR2

The CopyCap system call copies a capability from a source location (register or memory address) to a target location (register or memory address). The CopyCap system call does not return any output parameters. Exceptions may be generated by the references to the source and dest parameters.

Any of the exceptions listed in Section 6.2 other than DataAccessTypeError may be generated by this system call.

6.5 InvokeCap [syscall]

The InvokeCap system call invokes a capability, passing the supplied parameters to the implementing server. It is both the most complex and the most commonly used system call in the interface. It invokes one capability and optionally blocks for an incoming message on an Endpoint. InvokeCap takes a variable number of parameters determined by the invocation control word provided in IPR0 and a system-call specific