Lineiform is my attempt at writing a JIT as a library, so that it can be used by various programming language interpreters in an agnostic way.

JIT stands for just-in-time compiler. When programming languages are written, they usually first start off as an interpreter. Once they reach a critical amount of usage and support (and people complaining about how slow they are), they sometimes adopt JIT compilers. The interpreter side is usually written as a big case/switch over language operation types (Add, If, Break, Function call, etc.) which run a specific part of code; the JIT side, for code that is ran a sufficient amount of time, instead abstractly run a snippet of code, and record what operations it would run if it was being interpreted…and instead emit the equivalent operations as assembly in some compiler-ish way, so that jumping to the start of the assembly is equivalent to running the list of operations all at once without the overhead of having to switch and dispatch to the correct interpreter operation handler at each step.

There are a few problems to this.

For one, JITs are really fucking hard to write. They are the subject of PhD theses, like Maxime’s Higgs VM. Very good JITs like Java’s HotSpot or Javascript’s V8 have possibly millions of man-hours dedicated to them to try and make them as fast as possible, and LuaJIT (which is regarded as one of the best and “simplest” JITs ever made) is so extremely impressive it is a running joke that Mike Pall, its author, is a robot sent back in time from the future because it is so unbelievable that a normal human being wrote it himself.

Because they are so hard, the second problem falls out: not many languages reach that step. Most languages only ever had an interpreter, because the amount of effort to write a JIT outweighs the benefits that could be reaped by just trying to squeeze out some speed from the existing interpreter, or adding new features.

And the third problem is that the actual act of writing a JIT is fraught with peril: you have to make sure that the assembly you are emitting is exactly equivalent, from an external point of view, as running the interpreter operations naively. And the process of writing the assembly emission via abstract interpretation usually looks an awful lot like the normal interpreter, but slightly different - you have a giant case/switch over operations, and the operations do something that kinda looks like an If or a Break or a Function call, but just defer the actual act for a bit later. It’s tedious work, and because the JIT and interpreter operations are essentially 1:1, it’s work to keep them in sync whenever you touch interpreter code. But even worse is that it’s bespoke. V8 is one of the most complex programs ever written, and Google has poured more money than God into it, but all that effort is almost completely unusable for any other JIT because of how extremely specialized the work is to Javascript and V8 itself.

Lineiform is my attempt to solve this at least a little bit. The idea is that instead of having an interpreter, and eventually having to start nearly from scratch implementing an equivalent JIT, you write an interpreter in a slightly special way and make sure to call some of Lineiform’s library functions at certain points. Lineiform will then automatically do JIT-esque things: optimize hot code into runs of operations without dispatch overhead, or getting rid of extraneous work that an intelligent JIT compiler would’ve been able to elide. This is probably the utter height of hubris; the closest equivalent is something like Truffle/Graal or RPython, wherein you write an interpreter in some subset of Java or Python and, through just normal JITting, it also optimizes your interpreter: your language semantics are in terms of Java or Python operations, and it knows how to emit assembly and optimize those, so it comes along for the ride.

I don’t want to write an interpreter in Java or Python though, and they have some annoying constraints (~~such as Truffle/Graal requiring a commercial license from Oracle in order to use, and most people don’t trust Oracle with a 20ft pole…~~ edit: Woops, that’s a lie - the Community version of Graal is GPL, with an Enterprise version with extra features). I want to write an interpreter in Rust. And Rust doesn’t have a bytecode interpreter that it targets, so you can’t just write a JIT for that and JIT arbitrary interpreters “for free”. So what do you do?

Closure generation compilers

Well, one piece of the puzzle is something called a “closure generation compiler”. They don’t seem to be very well-known; I was introduced to the concept by a random Facebook engineer’s compiler presentation, and most of the prior art for it is by a single paper’s author, Marc Feeley. A normal tree-walking interpreter flow for an evaluate function is something like switch on Increment -> [recurse on argument -> switch on Constant(3) -> return 3] -> Increment 3, return 4. A closure generation interpreter would instead be a build function, which does switch on Increment -> [recurse on argument -> switch on Constant(3) -> return a closure that when evaluated returns 3] -> return a closure that when evaluated runs the constant 3 closure and adds 1. And then at the very end, to evaluate instead of build, you call the returned closure, which just runs the increment closure and then the argument closure: it separates the interpreter step into two parts, generating a big tree of closures that call other closures for each operation, and actually evaluating which only run the closures, which jump directly into each language operation runtime behavior. This reduces the interpreter overhead, since you don’t have the case/switch in the operations you have to do run the language.

I read that and thought it was neat, and then I thought a bit more and went: Huh, that sounds a little like a JIT. You are deferring the actual evaluation of language operations until later, and thus requiring the evaluation do less work because you were able to do some things ahead of time (specializing the runtime behavior to a closure you can just call). And how closures work in Rust is that they are essentially structs, which have a function pointer you jump to in order to run it, and a closed environment which is some list of values in memory which is passed to the function pointer as an argument: this is why let three = ||(return 3); ||(three() + 1)() works, because three is put in the closure environment and fetch out and called when it is run.

Forth and direct threading

Another piece of the puzzle is the programming language Forth. Forth is kinda neat, in that it has an interpreter design that isn’t used very often outside of it: it uses a direct threading interpreter. This is designed so that instead of a big case/switch of operations, where it jumps to the implementation of the operation and then jumps back to the case/switch for the next one, you instead build a big table of function pointers to call, each one its own operation. And critically, each operation ends with a jump to the next entry in the table. Running the interpreter is then just jumping to the first entry in the table, which runs its body and then it jumps to the next entry, which runs its body, which…etc. Constants, for example, are just snippets that return the constant value, and Forth is a stack based language, where operations store values on the stack or pop them to use as values. 3 5 + then is a program that jumps to a snippet that pushes 3 to the stack and then runs the next operation, which is a snippet that pushes 5 to the stack and then runs the next operation, which is to pop two values (3 and 5) from the stack and then add them and return.

One insight that direct threaded interpreters use is that you can compile direct threaded functions, however! If you have the function 3 5 +, you can just stitch all the functions together into one big function, and not need tailcalls at the end of operations. If you have a list of operations, then the sequence of instructions that it will end up running will by definition be all of them sequentially, and you could just emit and run that directly. If you have an optimizing compiler in front, then maybe it would be able to notice when it is emitting code that there’s no reason to push 3 and then pop 3 from the stack into a register for the add later when it can just use the constant 3 in a register directly, and the same for 5, and then maybe it can notice that it can just fold 3+5 to 8 directly and emit 8 instead. Because you had extra knowledge, namely the full list of operations that the function would run, you could do global optimizations and get rid of some of them.

This is one of the key insights that interpreters are missing, and why they are slow in a lot of cases: if you have a sequence of language operations that, say, add values together, those operations might have to check if the value they are adding is an int or a float, or an object with an overloaded + operator, and so has to do some work to figure it out when it is evaluating. If you have a concrete operation, such as 3 5 +, then you know what work is unneeded: your values are always numbers, and so you can specialize the add to only add together integers, and maybe even further to just compile down to a constant 8. Language primitives have to be maximally permissive, because they have to cover all possible inputs; if you know inputs, even only some of them, then you can reduce the range of possible cases they need to cover, and do global optimizations to remove redundant things. You can think of this as a special case of partial application, where you are fixing the function pointer that the tailcall at the end of snippets will run to concrete values, and so can apply it at compile time instead of having to do it at runtime.

Putting them together

Lets return to Rust closures. When you run the closure generation interpreter, you are taking some dynamic input and returning a tree of closures that have other closures they call in their environment; if you have the rust closure let three = ||(3); let five = ||(5); let add = ||{ three() + five() }; add() then it is one closure which has two entries in its environment (three and five). It jumps to their implementation, which return values. If you squint, this is very similar to how a direct threaded interpreter works, but the calls happen in the middle of the function instead of before the add, and so leave their data in immediates in the function instead of on a global stack. You still have concrete implementations of snippets of code you know you will be running if you call the closure, it’s just harder to figure out what they are since it’s not a linear array. In fact, in order to figure out what closures the add closure would be calling, you pretty much have to just run it, since the closure can be doing arbitrary x86 instructions, not just maybe some list of interpreter operations that you could special-case on how to put together and optimize. If you aren’t crazy, this is about when you start thinking about how rustc emits LLVM IR and how you can maybe leverage that, and invent holyjit. Unfortunately, I’m crazy, and don’t want to do things like vendor a custom rustc fork to build programs with.

Well, it turns out there is maybe hope. There is a concept of a dynamic recompilation engine. QEMU uses this, for example: when you run a QEMU aarch64 VM on an x86 host, behind the scenes it is translating those ARM instructions into an internal representation that keeps the same operational semantics as the ARM program, and then emitting x86 that from the outside has the same runtime behavior. QEMU is JIT compiling arbitrary ARM code from ARM to x86. And this isn’t restricted to only QEMU; things like DynamoRIO also use dynamic recompilation in order to JIT from x86 to x86, so that you can do neat things like add code before and after any branch instruction in a binary: it lifts x86 instructions to an IR, adds whatever operation you want done before branches, and then lowers the IR back to x86, making sure to fixup memory accesses so that nothing can tell that it was moved and modified.

And now that we have the background, we have Lineiform: Lineiform is just a wrapper function that takes a closure as input, and returns a closure as output. In the middle however, it will dynamically inline calls to other closures that the input might be doing, using dynamic recompilation to lift, optimize, and then lower x86 instructions to a set of x86 instructions with the same operational semantics, but hopefully doing less work such as inlining small closure snippets so there isn’t call/ret and ABI overhead, or folding constants like 3+5. We can have a performant closure generation interpreter, which is already plenty fast, and hopefully optimize out whatever work we can from hot closures, such as commonly called methods or loops, to avoid things like type checks, in order to incrementally JIT code faster and faster. And critically, this should all be very easy to integrate with an existing codebase or arbitrary new interpreter, since it doesn’t depend on any interpreter specific concepts or require re-implementing of your interpreter as an entirely new JIT like existing projects: you only have the one interpreter case/switch dispatch, which builds a reified tree of snippets it jumps between, which Lineiform collapses down into a single code block and optimizes.

One of the inspirations for this project is stumbling upon, years ago, a cool GitHub repo called redmagic that is kind of similar: it’s a C library that you would annotate a normal interpreter with, and it would automatically build program traces for a tracing JIT by just stitching together the assembly instructions that singlestepping would hit. In practice, that’s very different than how Lineiform work, but it’s also really cool and I’ve been thinking about it on and off ever since; it’s unfortunately also not very useful (with the quoted performance gains ~10% in very specific cases, and development stalled), and I’m hopeful that being a lot more structured will help Lineiform do a lot better.

And there are lots of cases where this could be useful - a hot interpreter loop that indexes into an array, for example, would just be calling an Index operation that needs to do bounds checks each time, because it has no way of knowing across interpreter operations that it has already done a check, or calling a method on some dynamic object multiple times having no way of knowing if the method exists on the object or not at each invocation. With Lineiform, the hope is that it exposes the multiple Index operations to a single function optimizer, which can remove the redundant bounds checks for example. And the pie-in-the-sky goal is to do some really tricky stuff with persistent caching of the IR in order to extract an ahead of time compiler for interpreters too, so that you only have to pay for the dynamic recompiling and optimization once, ever, for a program.

The goal is to provide a modest speedup for whatever interpreter is able or willing to use Lineiform, with a fraction of the effort of writing a normal JIT. It’s not going to be as performant as an artisanal hand-rolling JIT…but hopefully it will be close, and nearly free.

So far it doesn’t do much: it can inline some simple closures and fold simple math operations across them, such as:

let a: Wrapping<usize> = Wrapping(1);
let mut jit = crate::Lineiform::new();
use core::hint::black_box;
let clos = jit.speedup(move |()| { black_box(a) + Wrapping(2) });

optimizing down to

push rbp
mov rbp, rsp
mov eax, 0x3
mov rsp, rbp
pop rbp
ret

Forward from here

I was trying to use Cranelift, a small and fast codegen backend that is mostly used by Fastly for their WebAssembly JIT, for this but the usecase that I’m targeting is sufficiently weird that it didn’t work all that well. I’ve been working on an alternate codegen backend, handrolled specifically for Lineiform’s constraints, that I’m fairly hopeful will work out ok - the scope is a lot more limited than a normal compiler in various ways, so it’s not like I’m going to have to reimplement all of LLVM. I’ll probably implement a simple closure generation Scheme interpreter, and then try to implement operations as needed to make it work until I have a PoC; I don’t know how long this will take, but it’s probably a while. In the mean time, if you’re interested you can check out the (extremely work in process) GitHub repo.

And if it works, and if it’s fast, and if it’s useful? Who knows. I’d like if Apple or Facebook paid me a lot of money for this library in order to make things fast (think being able to JIT compile build scripts or DSLs), so I’m not sure if I’ll opensource or only source-available it, just in case. If you’re the type of person that cares about licensing (or just want to chat about compilers or systems programming things!), reach out to me on Twitter.