|
22aug09
«
continuations
concurrency « In distributed systems, more than one thing happens at the same time. Nodes in a network do work out of sync with one another, but send messages to one another so progress in some distributed task can occur. Each node has the problem of keeping this straight. Representing this in code is notoriously complex, and there's more than one way to do it. Threads are a common solution approach, albeit a slightly lazy one: you can use a blocking api and simply have threads wait until future async events occur. But you can get several kinds of waste from this, including implicit throttling of concurrency from a maximum number of threads you allow. Or, instead of threads, you can just make async state machines driven by events when they arrive. I've worked on both sorts of system, and converted from one to another, in both directions. But here I'll focus on an async event approach. I can show you how to make threadsafe data structures—I'm good at it—using both locks and lockless techniques, but I won't. It's boring: there's a cut and dried way to do it. And the result is unsatisfying, because it doesn't scale easily. Why do I find threads less interesting? Is there something you can do with async events that threads cannot? For one thing, you can't simulate real systems very closely using threads, while sending events as messages is a way to model message propagation similar to what actually happens when devices signal one another. I think simulating real systems more accurately is a better way to experiment, with more informative results. Using threads you can write a node that responds to a distributed system, but it's hard to model the entire system itself, which I think you should do in complex systems if you pursue a scientific stance. 10oct09
«
tying knots
lane fibers « (I thought I was going to finish this when I started; but after two hours, I still have several things more to define, including msg M, lane L, cond C, mutex X, and the guts of zone Z, and those are the complex objects in this model. I guess I should do them tomorrow, instead of staying up into the middle of the night.) I recently wrote about a hand-rolling fibers in C++, which I dubbed lane-based fibers, aiming to define a fiber as a continuation in a channel called a lane (like a highway lane or swimming pool lane). In this model, a lane contains a frozen image of what a fiber is doing, consisting of little besides the continuation, an optional return value (if the continuation is an upward return), plus any bookkeeping for a lane engine (aka fiber VM, aka lane VM). For extreme brevity, say lanes to mean lanes-as-fibers. This piece today is about my efforts to get lanes to work. It's been going slowly. My code seesaws between elegance and complexity. When it gets complex, I think I'm doing it wrong. (Actually, I think something more like: what the hell am I doing?) Then I go through a round of simplification and say, oh, this is still cool enough. But now I grasp why folks don't take this sort of approach more often. In C++, lanes seem at least one order of magnitude more complex than the basic idea seems to require. C++ adds all this accidental detail obscuring the essential issues. I find that I must keep a lane model in my mind's eye because code hides too much of it. This seems wrong to me. At one point I suddenly realized lanes would be more clear in macro assembler than it is in C++. What's wrong with this picture? Apparently I may have to do this one or two more times before I like the end result. I'm having a "plan to throw one away" experience here. So far it looks like the first rev will be good enough to use, so it's not a waste of time. But I might be the only person who easily understands what the code does. That's not a good sign in code. Good code should clearly do what it purports to do, in a way readable to any fairly good programmer. What I have now is twisty. At one point I was thinking I might write another Lisp (called Lathe) directly using this C++ level async api with hand-rolled continuations. But so far it seems way more complex looking than ordinary synchronous code. That is, it looks more complex than what it's doing, which means the way I expressed it must be wrong. How should I have expressed it then? I'm thinking I should express what happens in a high level syntax—Lisp syntax, for example—and then generate the goofy parts, compiling high level syntax to C++ or C, which would simplify the basic expression, and remove some of the element of human error from translation to C++. But that begs a question of what first synchronous Lisp processor is used to bootstrap an async version. So now I think I ought to write a simple sync version of Lathe so I can use it to compile parts of the async version. But that basically amounts to a longish hobby project at home. I try not to get myself too easily enthused in longish hobby code projects now. But in this case, it seems like it's more likely to be worthwhile than things I aimed to do before. Why do I think that? Because what I'm seeing in the clunky version of lane-based fibers now looks promising despite the ugliness. It looks like it will be able to do something—async simulations of distributed systems—now currently very hard to do with tools at my disposal. And it has enough value. The rest of this is a kind of pseudo-code description of lane-based fibers. I want to write another view of it here that's simpler, so I can hold the image more clearly in my mind. Then if I get stuck in C++ again, I can doodle in notation like this to resolve an issue more quickly. A node, N, is a refcounted baseclass for all other refcounted objects (strongly resembling a yn node class in the node demo) supporting weak and strong refs, permitting a node to release resources when no longer used, before it's deallocated when no longer referenced. When a node N is created, you must supply a handle to hold the first ref, and a vat to free that node when it hits zero refs. Nodes can print themselves if you care to write the code. A handle, H, is a smart pointer to a node subclass. In template notation H<T> means a handle to type T, where T is a subtype of node N. To hide template notation, for each type T we might define typedef TH which means the same thing as H<T> (that is: a T handle.) In general, suffix H on a typename means a handle specific to the name's prefix. A generic node handle has type NH (aka H<N>). A main purpose of H is to ensure nodes are refcounted correctly, despite assignment, copy construction, and destruction along any C++ code path, including exception throwing and early exit via return in the middle of a method. (A secondary purpose of handles is to enable auditing to find refcount leaks on a backtrace by backtrace basis, if you write the audit code. The refcount of a node is defined to be a count of refs from handles only. You must use a handle to refcount, and nothing else can be used to refcount. Refcounting then becomes spatially oriented, based on where nodes are counted, rather than temporally oriented based on when counts change. Blame is only easy to audit in a spatial sense, with backtraces used as measure of spatial location. In node usage, memory safety is simply a matter of being able to see—in a local scope—at least one handle holding a node in a given context.) A vat, V, is a memory allocator specific to nodes, so when a node hits zero refs it can be returned to its source vat. Presumably this makes it easier to pool nodes and statically partition memory in fixed sized pools for nodes, if you're willing to tune a vat to fit a very precise app profile. (Otherwise a vat baseclass will simply use malloc() and free(), if you can't be bothered.) A pool, P, is another kind of memory allocator specific to ordinary memory blocks that aren't nodes. (A base pool class also uses malloc() and free(), so you need to subclass if you want something more complex to occur.) A jump, J, is a pointer to a function called by a lane fiber trampoline. From the perspective of a lane fiber, all functions have one and only one type signature, of type J. A jump method is a leaf call from a trampoline's view; inside a jump leaf, any synchronous C or C++ code can be called, but the lane engine doesn't know about this. Each jump called by engine's trampoline is inside the current lane's continuation, denoting the next thing a fiber does when the scheduler lets it run. The type of J is int(*J)(B*), taking a pointer to a nub and returning an integer of type yaw telling the trampoline where to veer next. A yaw integer, Y, has three possible values (negative, zero, positive) which denote exit, yield, and go resembling colors of a stoplight: red, yellow, and green. A negative yaw integer returned from any jump method J tells the trampoline to make the lane exit and return the last value to any listeners. A zero yaw tells the trampoline to yield the remainder of the current lane's timeslice. Postive yaw means a lane wishes to go further, timeslice permitting. A nub, B, is a triple (Z, Ai, Ao) denoting what any jump J needs to see. A zone Z (defined below) is all the runtime context of a lane fiber engine, providing access to anything a fiber needs to reach. An arc A (defined below) is the current fiber state of a lane, including code, stack, and optional return value. The input arc is Ai; the output arc is Ao. In effect, every jump J acts like a function of two arguments (Z, Ai) returning two values (Y, Ao) so Ai denotes what a lane should do now, and Ao denotes what a lane should do next when the scheduler calls the J inside Ao (if the return yaw integer is not negative for exit). (A nub B is a transient object defined on the stack in a trampoline method in the scheduler that actually calls each J method. The physical space for arcs Ai and Ao swap on successive calls, so Ao becomes the next Ai passed to the J extracted from Ai. The only reason Ai and Ao are two different arcs, instead of one inout parameter, is to minimize risk in updates under refcounting. If any J method treats Ai as mostly immutable, except for the stack inside, things should be good. Two arcs alternate to avoid extra refcount work.) To refer to an idea of continuations, I use several different short words with similar meanings to emphasize roles with minor differences. Since a continuation is viewed as an arrow denoting where a lane should go (like a vector in a point space), some of these words are synonyms for arrow. The exceptions are kont and knot, which both refer to continuation; a kont continuation uses a handle to refcount its frame, but knot does not, so knot is cheap, transient, binding of code and data without frame refcounting. The term way is a specialization of jump with direction, and term arc is a specialization of kont with payload added. A way, W, is a pair (J, S) where S is a C string representation of J for human readable presentations (e.g. in backtraces) which might include a longer type signature than J's uninformative minimalism. (For example, you might want to describe J's stack format, especially the input parameters which cannot appear in uniform J types.) In general, W gathers all metainfo about J we want to know. For now, a human readable C string is helpful on platforms (like the Mac) where it's hard to get the symbol string for a code address. (Think of way W as meaning the same thing as code, denoting which way a fiber goes next in its lane—direction without state or payload. Each way is now just a J function to call in the trampoline, plus a function name annotation so we can always render return addresses in backtraces.) A kont, K, is a continuation consisting of code address and data stack, binding together executable code with an activation frame. Each kont is a pair (W, FH) where FH is a frame handle (of type H<F>) holding a reference to an activation stack frame, where a frame F (defined below) is a node N subclass, so stack frames can be refcounted to build spaghetti stacks in the heap. In practice, most kont K instances are embedded either in A arcs or F frames, to represent either a lane fiber's arc frozen state of execution, or the return address and stack of a caller in a frame. A handle to a frame keeps that frame alive. A knot, Kn, is a proto-continuation pair (W, F) which is just like a kont pair, but without refcounting a frame. In practice, a knot is a very cheap, transient continuation used in dispatching without refcounting too soon. If you like, think of kont K and knot Kn as being exactly the same thing, but letting you choose whether cost of refcounting makes sense in a given context. A frame, F, is a refcounted stack activation record deriving from node N for refcounting; this means when constructed, any frame needs a handle H to hold a first ref and a vat V to free the frame when the last ref goes away. Frame subclasses add any members needed to handle parameters, locals, and any other state needed while a jump J uses that frame. Looking at base class members only, a frame is a pair (W, K) where way W is the first entry point using this frame and kont K is the continuation of the caller used for replies. (Returning in continuation passing style, CPS, works by invoking a caller's continuation.) A backtrace is the list of activation records from the topmost frame to the base of the call tree; in text form, this is the sequence of names in the way W of each frame in the list of frames chained through the K continuation. (We need to save the way W of the original entry point when a frame is used in order to mimic normal backtraces in plain old C, because otherwise we won't be able to see what method we're "inside" when return addresses inside a method are completely different jump J methods.) A ret, R, is an optional return value that might be used when returning from a callee to a caller, since a continuation kont K does not include a return value payload. Because a return value might refer to allocated space (for example, inside a callee frame that is returning), each ret R contains a node handle NH to hold a ref to any node that contains part of the value returned. Otherwise a ret R contains a small amount of integer based state, including a bare minimum of metainfo in the form of a type field to denote type of value. Each ret R is a 4-tuple (NH, Is, It, Iv) where Is is an integer status (typically understood to be an errno code), It is an integer type, and Iv is an integer value (if this is enough space—otherwise a node might be necessary). The easiest way to return a large value to a caller is simply to pass a callee frame ref in node handle NH. An arc, A, is the frozen state of a lane fiber continuation. Each arc is a pair (K, R) where kont K is the continuation for what a fiber should do next, and R is a return value (only used if K is a return address inside a method). Because the stack inside kont K is refcounted, and because a large value in return value R is refcounted, an arc A instance can be embedded anywhere. So the arc inside each lane instance is just how an arc is used to represent a fiber. An arc can also be put anywhere else; for example, you can put an arc inside an event in order to represent a lane fiber that should be created and run when an event fires. (In general, you can represent signal handling by saving signalled behavior as arcs that can be put into a newly spawned lane to execute at need.) An elem, E, is a queue element for doubly-linked lists similar to yqe in the list demo (cf «), used for intrinsic linking for all objects typically resident in one and only one queue in the lane fiber runtime. Several objects defined below are subtypes of E so they can be efficiently put into exactly one queue Q at a time (per inherited E base, if more than one occurs). Note definitions below might say type T has an E when in fact T is an E. There's not much point in distinguishing has-a and is-a in this presentation. A queue, Q, is a primitive, double-ended, circular linked list whose members are all elem E instances (or subtypes of E). Note when any elem E is inside a particular queue Q instance, that elem knows which queue it is inside. (The actual E code looks like yqqe which is not now in the list demo.) So queue Q can assert it actually contains an elem in any given operation. Additionally, when queue membership is used to represent object status—for example, a ready lane is inside a specific ready-queue—we can test for that status just by looking at the elem's queue. Note we can use template notation Q<T> to say a queue Q of elems that are all subtypes of T, where T is an E subclass. A zone, Z, is a kitchen sink for all state related to running a lane fiber engine virtual machine, including all lanes that can be allocated, and all other objects of limited population associated with a lane runtime. For example, each zone has a pool of n worker threads to handle blocking system calls. If you like, just think of a zone Z as all the global objects in a lane fiber runtime, but gathered in one place so you can have more than one zone in an address space. Each zone contains at least one vat and one pool, so anything allocated by a lane fiber can be allocated from the lane's zone. 11oct09
«
string-joined cans
lane queues « Let's resume where I stopped yesterday. (See the long lane fibers entry.) Object types not yet defined in detail typically involve one or more queues in lifecycle management. So I might seem to harp this aspect. (Please note I'm actually coding this model, so it's not just study. Of course I don't use these short names; I spell out Lane in full. This version is just easier to reason about, compared to using C++ notation.) To hide template notation, instead of writing Q<T>—for queue of type T—we might instead write TQ to mean "T queue." This shorter notation looks better when we add another letter to denote queue purpose. For example, rTQ for "ready T queue" is easier to read than rQ<T> which looks like symbol salad. So the ready lane queue and wait lane queue in each zone Z are named rLQ and wLQ. An identity, I (also written ID, pronounced eye dee), is a pair of integers (Ii, Ig) where Ii is an index (into an array), and where Ig is a generation (typically pseudo randomly generated). Zero is never a valid generation number, so an ID is invalid when Ig is zero. An ID is also invalid when Ig does not match the current generation number of the entity named by the ID; this is in fact the purpose of generation numbers. Each I identity is 32 bits in size when Ii and Ig are 16 bits apiece; otherwise an ID is 64 bits in size when each sub-part is 32 bits. You should think of an ID as just an integer; a compiler will pass an ID by value just as efficiently as a 32 bit or 64 bit integer. A lane, L, is a 5-tuple (E, I, A, eMQ, iMQ) if you add no other stats or metainfo, like time consumed by a lane. Each lane inherits elem E as a base, so each lane can be an elem in one (and only one) LQ lane queue; so far I needn't put lanes in more than one queue. I is a lane's identity; I is only valid when a lane is alive (spawned and not yet exited). A zone Z can lookup a lane by its ID, and this happens (for example) when a msg M (defined below) is delivered to a lane by ID. A lane's arc A is the frozen state of execution for a fiber in this lane, as defined earlier, consisting of pair (K, R) to hold continuation and optional return value. Holding arc A is a lane's main purpose; the rest is bookkeeping. When a lane exits, each msg in the eMQ exit msg queue is delivered. When a msg is delivered to a lane, it arrives in the iMQ inbox msg queue. (Msg delivery might invoke callback code only, if no lane is addressed by lane ID in a msg.) Both eMQ and iMQ are doubly-linked circular lists of msg M instances in Q<M> format (using the intrinsic E links inside each msg). Listeners place a self-addressed msg M in exit queue eMQ to be delivered when a lane exits. When a lane waits in arrival of async msg M delivery, these arrive in inbox iMQ. There's a lot to say about contracts in msg delivery, with parts appearing above, below (in a definition of msg M), and later, when zone Z is described further. Why is the topic of msg delivery so complex compared to other objects? Because msg M behavior crosses a lot of dynamic execution domains. While this is also true of arc A when a jump J maps an in-arc Ai to an out-arc Ao, in that case it's outside the scope of lane engine control. Arc changes are what app code does when lane fibers execute. In contrast, msg changes are partly how the lane engine runs, with api crossing into scope of app code running in a jump J. For example, when does a msg M in a lane's inbox iMQ actually get delivered to a lane fiber? Only when pop() is called on zone Z inside a J function. Presumably this happens when J is called as the continuation of a wait() when a fiber runs again after blocking; the arrival of a msg when a lane is waiting makes a lane ready, by moving a lane L from the wLQ wait queue to the rLQ ready queue in zone Z. (Don't worry, I'll say this again later.) In that context pop() should succeed in popping a frontmost msg from the current running lane's inbox iMQ, or otherwise the lane should not have unblocked. A callback, Cb (as opposed to cond C), is a pair (Mf, p) where Mf is a function pointer of type void(*Mf)(M*) taking a msg M argument (defined below), and where p is an untyped pointer void* to allow extra state to be passed. This sort of callback is embedded inside msg M to attach executable code to a msg processed in one of two different basic contexts. When a msg M is sent to a thread pool's request queue (a threadsafe MQ) then Cb is called in a worker thread, and presumably p (and/or a node N inside msg M's handle H) indicates state needed by that blocking call. Otherwise, when a msg is delivered from a lane's eMQ exit msg queue by the lane engine's thread (not a worker thread), Cb calls Mf at delivery time if Mf is a nonzero function pointer. In general, callback Cb represents an optional something to be done when a msg M reaches either 1) a worker thread, or 2) a lane's inbox when delivered from another lane's exit msg queue. A msg, M, is an 8-tuple (E, Ik, fid, tid, K, R, Cb, H) if you add no other stats, metainfo, or extra callback methods. (In practice I use another callback besides Cb shown here, but its gnarly explanation suggests it is accidental and not essential complexity. Basically, sometimes you want callback layering, so one callback can call another one.) Each msg M inherits elem E as a base, so each msg can be an elem in one (and only one) MQ msg queue; so far, only one E is enough per msg. Integer Ik denotes the kind of msg, in case more than one type of msg is expected by a recipient. Both fid and tid are instances of I lane ID, denoting from lane and to lane—the sender and receiver of this msg. If tid is valid (with a nonzero Ig generation number) then tid identifies a lane to receive this msg when delivered. You can leave tid invalid if no lane wants to receive this msg; perhaps everything is done by Cb when called in a worker thread, or when some lane exits. Lane fid is either a lane that exited, or a caller of a non-blocking method in a worker thread. Handle H refs an optional node N if this msg needs extra state to work; H might typically be a ref to frame F that sent the msg, since it can hold any state needed. The kont K plus return value R in each msg M together look a bit like an arc A, except they are not used together as an arc instance. Instead they're used at different times; however, one of the uses is turning them into an arc A in case a msg needs to be processed by a new or old lane fiber. When a lane exits, every msg M in that lane's eMQ exit queue is delivered, but first the lane's last return value is copied from a lane's A to this R inside M; so a lane L notifies all listeners with a copy of the lane's last known return value. The kont K in a msg is a good way to encode how a receiving lane fiber should respond to a msg when it arrives in a lane's iMQ inbox. When a lane waits on the inbox and finally gets msg, the K inside might be exactly the continuation the fiber wants to call next, perhaps with the msg tucked into a frame for posterity. Basically, it's hard to effect a "join" operation between a waiting lane and an async operation without using a K continuation. How should you handle waiting for several related async operations? Say you issue N async requests, where each will reply with a msg arriving in a lane's inbox. After starting every request, you wait for a msg to arrive, then pop it from the zone Z when the wait returns. Until all N msg replies arrive, save every msg in the frame of kont K in the lane's arc A. Since every msg popped this way is inside no queue, you can put an MQ msg queue in your frame and store them this way. Once your N-way join is complete, pop all each msg from your frame's MQ and free them (by returning them to the zone Z). More complex forms of async coordination can be done using mutexes and condition variables, which are supported by mutex X and cond C, defined below. These are basically clones of pthread mutexes and condition variables, but have nothing to do with native threads. Instead, calling any mutex X or cond C method amounts to talking with the lane engine scheduler, with a net effect of scheduling your lane to run mutually exclusively with other lane fibers using the same mutex, by blocking a lane attempting to lock a mutex that is already locked. Similarly, condition variables are also cooperative and effected by the lane engine scheduler. (See literature on monitors for background.) A mutex, X, is a 6-tuple (E, Z, S, LQ, lid, Ic) providing cooperative mutual exclusion for lane fibers using one mutex. Fibers are scheduled cooperatively, not pre-emptively, so one fiber cannot interrupt another inside a single jump J. But a fiber context switch can occur between any two jump calls made by the trampoline. When a lane fiber needs exclusive access to some resource across spans of multiple continuation calls, a mutex X is needed to block other lane fibers should they try to use that same resource during the same critical section. While one fiber has a mutex locked, that lane's ID is stored in lid so an explicit representation of a wait-for graph can be analyzed at any time (to see deadlocks, for example). When a mutex X is already locked, any other lane attempting to lock the same mutex will block, moving that lane from zone Z's ready queue rLQ to the LQ lane queue inside X. When a mutex is later unlocked, at least one lane can be popped from LQ and returned to the ready queue. (Perhaps it's best to simply lock the mutex immediately on behalf of the popped lane that becomes unblocked.) Other state inside a mutex X is just bookkeeping. Each mutex inherits E as a base so each mutex can be put into a free list or a used list in zone Z. The used mutex queue uXQ in zone Z is where all currently used mutexes live, so we can later inspect all synchronization primitives when analyzing a wait-for graph. C string name S is a mutex annotation showing a human readable name for a mutex. A pointer to the zone Z containing the lane runtime is also put inside each mutex X to ensure access (and to compare to Z stored in each cond C that uses a mutex). Integer Ic is a count of cond C condition variables currently using the same mutex, helpint detect errors like freeing a mutex when still used by a cond. A cond, C, is a 5-tuple (E, Z, S, LQ, M) providing cooperative control to block lanes waiting for some condition to occur. Each cond must refer to a specific mutex M that must be locked when a cond C is used. (This is just like pthread style api.) Other than M, all state in cond C is very similar to parts of mutex M with the same names: so S is a C string name annotation; E is a baseclass so zone Z can put conds in either a free or used queue; Z is known by pointer to ease access; and LQ is a lane queue containing lanes blocked on this condition variable. Let's revisit zone Z now and finally define some of the internal state of a lane engine runtime. (I'm not going to define how the thread pool works, but it's very similar to what is shown on the thread demo). To avoid saying Z is an N-tuple for a large value of N, maybe we should group parts of Z in sub-objects. Hmmm, maybe later. A zone, Z, is a big object containing all the runtime state of a single lane engine. Each Z is a 14-tuple (L, A, V, P, tMQ, dMQ, fMQ, rLQ, wLQ, fLQ, uXQ, fXQ, uCQ, fCQ) with ten queues consisting of three msg queues: thread, done, and free; three lane queues: ready, wait, and free; two mutex queues: used and free; and two cond queues: used and free. The default vat is V, and the default pool is P, both used for allocation. Parts of zone Z not shown consist mainly of lots of statistics and config state (to specify fixed population sizes and timeslice parameters, etc). When no lane fiber is currently running in the trampoline, both L and A are nil. But inside the trampoline, when a J method is still running, L is the current lane and A is the current arc in that lane. A zone runs only one lane at a time: L is that lane, and A is either the arc inside L or a second alternate on the stack inside the trampoline. (When analyzing a global snapshot of state inside a zone Z, every lane can be examined to see an arc inside that lane, with the sole exception of lane L, which might have its arc in a zone's A instead.) The thread msg queue tMQ is the outbound queue of blocking call requests handled by a worker thread pool. When a request is done, the msg returns as a reply in the done msg queue dMQ. Finally, the free msg queue fMQ is a pool of currently deallocated msg instances. (Currently blocks of consecutive msg instances in an array are allocated in 8K slabs. The same is done for many other object types, like mutexes and conds.) The ready lane queue is the set of lane fibers that can currently be scheduled by the lane engine. None of these fibers are blocked. The wait lane queue is the set of lane fibers blocked until a msg addressed to that lane is delivered, either by reply from the thread pool or by notification from another lane's exit message queue (see eMQ inside lane L). When any msg is put into a lane's iMQ inbox, if that lane is in the zone's wLQ wait queue, it gets moves to the rLQ ready queue, because msg arrival wakes a lane when waiting. In zone Z, the lane engine's trampoline polls the dMQ done msg queue for thread-safe replies from the thread pool, perhaps as often as between each timeslice. A msg with an invalid to lane is simply deallocated; but any msg addressed to a lane with a valid ID is appended to that lane's inbox, causing the lane to wake if it is waiting. Note message delivery has no effect on either mutexes or conds, which alter lane scheduling themselves when a queue in either X or C changes. (Todo: describe the trampoline.) |