[coyotos-dev] Thoughts on FCRBs and asynchrony

Jonathan S. Shapiro shap at eros-os.org
Thu Feb 2 11:19:47 EST 2006


I only want to reply very quickly to a couple of details. Charlie's note
deserves a serious reply, but I have a hard deadline for something
today.

On Wed, 2006-02-01 at 20:46 -0800, Charles Landau wrote:

> > kernel threads are  not cheap
> 
> The correct statement to make here is "Pentium kernel threads are not
> cheap". No doubt this is true for other architectures too (Sun comes
> to mind), but there are some common architectures with more reasonable
> state size. Granted, Pentium is an important case to handle.

I agree that the overview is not making the case well yet, but I don't
think that this statement is correct either.

The following processors have large register states because of the
presence of vector processing and/or floating point units:

  Pentium family, including AMD
  Itanium
  SPARC
  ARM versions with the Neon extensions
  MIPS processors having vector extensions
  Coldfire
  PowerPC

So I'm not clear which architecture *doesn't* have a problem these days.
Particular processor *implementations* that omit features may not have
such issues, but all of the active architecture *families* that I know
about do have them.

However, even if we completely ignore register state size, we must still
address the following issues with multiple kernel threads:

  Requirement for a per-kernel-thread stack (not needed with FCRBs if an
    event loop design is used)
  Per-kernel-thread capability registers

This is a substantial amount of state. While it doesn't all need to be
unloaded and reloaded, it still occupies real memory and creates an
impediment to scalability.

> If the thread size is 4KB, it's not only the space, but the context
> switch time that has to be a concern. I believe the Coyotos design is
> a recognition that thread state is large and increasing and the
> asynchronous architecture is a way of reducing both the number of
> threads and the number of context switches.

I hope so, but I'm uncertain about the context switch issue. I think
that the switch time can be reduced, but if you have multiple user-level
threads of control that are actually using all of that state you are
going to have to move it one way or the other.

What I think we *will* likely be able to reduce is the amount of
blocking delay incurred by senders.

We should also be able to support certain cases that don't happen in
many programs, but are important when they do happen: preemption and
non-blocking notification.

> I believe Shap has created a new object or objects (FCRB) for this
> (though I still don't follow the details of that design). FCRB's are
> smaller than Pentium threads, so that is a good thing. I believe this
> is an increase in complexity, but it appears justifiable. I do wonder
> if there mightn't be a simpler design that meets the requirement.

Let me attempt a simplified description of FCRBs. I should probably
re-work some of this into the document.


Oversimplified FCRB's:

The FCRB doesn't hold the state. The FCRB holds a pointer to an extended
receive block descriptor (ERB) and a capability to the receiving
process. The ERB says where the message goes. The ERB corresponds to the
KeyKOS "exit block".

Real-World FCRB's:

An optimization is introduced so that a small number of registers can be
delivered directly to receiver registers. Logically there is space for
these registers in the FCRB in case the FCRB cannot be delivered
immediately. There is also space for them in the ERB so that the
activation handler has a place to store them. When the FCRB is later
delivered, these values are delivered to the application in registers.

This optimization has two consequences:

  1. In the usual case (immediate delivery), the registers will never
     get stored into the FCRB, because they will go immediately to 
     receiver registers.

  2. Since 50% of all messages consist entirely of a return code, these
     need not reference the receiver address space to examine the ERB.

The payload transfer mechanism is actually very similar to the
corresponding KeyKOS mechanism.


Activations:

One way to look at the activation stuff is that we are adopting the
activation mechanism in preference to the multiple waiting states
(available, waiting) of KeyKOS. Activations have two advantages:

  In addition to subsuming available/waiting, they can be used
  for other purposes that are critical in real-time applications.

  They can be used to do efficient user-level thread scheduling.

The second point may not be a feature depending on your philosophy of
design. I am not a fan of using threads, but they exist and they will be
used. It is better to place the burden of thread management on the
process rather than the kernel. Also, in certain restricted patterns
they can be used to great advantage: notably, they can be used to
convert asynchronous arrival into an internal event-driven processing
loop. This is one of the few threading patterns that we actually have a
very good handle on as builders.

> Consider again that example server with 1024 threads. It needs
> separate threads because the network server(s) that it is
> communicating with demand to reply promptly (and the example server
> demands not to lose any replies)....
> (If prompt reply wasn't needed, the example server would give the
> network server start keys to a single thread with distinguished
> keydata. In that case the management of the keydata looks a lot like
> the management of the protected payload of an FCRB.)

Actually, prompt reply isn't the problem at all. The server isn't
speaking to network servers. It is performing simultaneous reads on 1024
TCP/IP connections. The majority of these connections are idle. I do not
see how to give start capabilities to the TCP/IP stack -- it is
unwilling to allocate the needed storage to hold these capabilities.

> If the problem were simply the cost of an extra IPC for forwarding,
> there are ways the kernel can optimize that. I know Shap has looked
> into this and found challenges in multiprocessor systems. I don't know
> whether the new design meliorates those challenges.

I was never able to figure out how to optimize this case successfully. I
was able to avoid redundant string transfer at the IPC layer, but not at
the cache coherency bus later. Krieger's measurements of K42 and
previous work show that this can account for 40% or more of the cache
bus traffic, and that it is a very significant overall performance
penalty.

The new design addresses this issue by introducing a new layer of
kernel-managed indirection: receive queues. This is an additional and
undesirable complication, but I have not found a better approach.

> > Looking back, it now seems likely that none of us  adeqautely
> > considered the issue of space or thought hard  enough about
> > application-level complexity.
> 
> 
> Looking back, we did consider those issues, and resolved them
> differently because at that time thread size was much smaller, context
> switch times were relatively faster, and tolerance for complexity
> still had limits.

The "we" I had in mind was myself, Jochen, Bryan, and a few others who
were working in the '90s when the changes to architectures were already
clearly written on the wall. I apologize if the brush painted too
broadly, and I will look for a way to clarify this.

I still dispute the complexity part. In fact, I will wager that the
relevant pieces of Coyotos will be (on balance) as simple as the
corresponding pieces of CapROS (I pick CapROS because you have cleaned a
bunch of cruft out of EROS already). It may not be as fast. We will lose
some simplicity in register save. We will gain a fair amount in the slow
path.

And I *definitely* dispute the *tolerance* for complexity. At this
point, I think the complexity is in our heads rather than in the code.
This is a problem of description, and it is too early to indict the
design for lack of documentation. It has only existed for two weeks! A
simpler design that (by measurement) doesn't meet the needs of modern
machines and applications isn't simpler; it's insufficient. I would
welcome a variant of the EROS networking stack that operates within the
KeyKOS primitives, has competitive performance, and does not
re-introduce denial of resource or denial of control flow challenges.


shap



More information about the coyotos-dev mailing list