lunar_wave/notes.md

5.1 KiB

Optimizations

Making notes on optimizations I've made and plan to make, so I can remember which ones paid off.

String interning

Worked well. PUC Lua does this. I think it's faster not because it avoids hashing or comparing strings, but because it avoids the pointer deref. I still ended up hashing ints after this change.

The n_body benchmark uses tables with about 7 slots in its hot loop. The hashing overhead of HashMap for i64 seems pretty bad for this. BTreeMap was faster, but not fast enough.

I switched to just an unsorted Vec and linear search, and it's the fastest by a small margin.

I don't think PUC Lua does this, but PUC Lua might have a faster, less secure hash algorithm than Rust's default.

Flamegraph reveals we still spend a lot of time in linear searching tables.

Lazy instruction decoding

I think this actually slowed it down. PUC Lua keeps instructions in their encoded u32 form and decodes them lazily inside the interpreter's main loop.

I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough.

Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work.

Caching the current block

I think this one paid off. The idea was to avoid some chunk.blocks [i] derefs and bound checks in the inner loop.

I used an Rc to make it work. PUC Lua probably just keeps a raw pointer to the block.

Caching the current instruction list

I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list.

Using Rc <[u32]> was fun, too. I never stored a slice directly in a smart pointer before.

Fat LTO and codegen-units = 1

Did absolutely nothing. I couldn't outsmart LLVM.

Remove RefCell

Result: Worked well. Dropped from about 3,700 to 3,200 samples. The code got even uglier.

Plan:

I think the borrow and borrow_mut calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing Rc <RefCell <Table>> with my own ref counting. This might remove a layer of indirection, too.

It's a big change, but I'd need something like this for adding a GC anyway, and sometimes big changes have paid off.

Cache constants

Result: Regressed from 3200 to 3600. Not sure why.

Plan: OP_GETFIELD hits the constants a lot. I thought caching it and not dereferencing the chunk and block constantly might help.

Splitting up the opcode match

Result: No change, 3200 to 3200. Maybe Rust was already optimizing this into an optimal jump table?

Plan: Maybe if the hot inner opcodes, OP_GETFIELD, OP_MUL, and OP_SETFIELD get their own match statement at the top of the function, the step function can exit sooner.

Cache constants

Result: Regressed again from 2900 to 3700. It's not meant to be.

Iterating over instruction list

(upcoming)

I noticed PUC Lua doesn't store a program counter, it stores a u32 *, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step.

Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer?

Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.

Threaded interpretation of basic blocks

(upcoming)

Plan: I don't want to do this, because it's against the spirit of matching what PUC Lua does. And I would need to set up a microbenchmark to prove that it would have any chance of paying off. And it's a little bit over-fitting to the n_body benchmark, whose inner loop is heavy on number crunching.

But it's inspired by QEMU's TCG, so I think it could work.

There's 2 places in the n_body inner loop where we have 5 math instructions, 3 muls, and 2 adds, that computer the squared length of a vector. If, when the block is first loaded, we could detect this as a "block of only non-branching math instructions", we could replace that block with specialized instructions for a non-Lua-compatible interpreter. When we hit that block, we make sure all 3 input registers are floats, and then we execute these alternate instructions using special float-only registers and possibly a threaded interpreter mode. When those 5 instructions are done, we either copy the float registers out to the Lua value registers, or the last couple instructions write out Lua values instead of floats. This might reduce the stepping overhead, let us use a simpler decode step (since there would be no instructions for tables or call/return), and we'd be using 8-byte floats and skipping over a couple type checks inside that block.

The cost is, it adds a ton of complexity, it's a new way to fail, and if the biggest block I can find is 5 ops, it may not pay back.