diff --git a/notes.md b/notes.md new file mode 100644 index 0000000..034c98a --- /dev/null +++ b/notes.md @@ -0,0 +1,67 @@ +# Optimizations + +Making notes on optimizations I've made and plan to make, so I can remember which ones paid off. + +## String interning + +Worked well. PUC Lua does this. I think it's faster not because it avoids +hashing or comparing strings, but because it avoids the pointer deref. +I still ended up hashing ints after this change. + +## Linear search + +The n_body benchmark uses tables with about 7 slots in its hot loop. +The hashing overhead of HashMap for i64 seems pretty bad for this. +BTreeMap was faster, but not fast enough. + +I switched to just an unsorted Vec and linear search, and it's the +fastest by a small margin. + +I don't think PUC Lua does this, but PUC Lua might have a faster, less +secure hash algorithm than Rust's default. + +Flamegraph reveals we still spend a lot of time in linear searching tables. + +## Lazy instruction decoding + +I think this actually slowed it down. PUC Lua keeps instructions in their +encoded u32 form and decodes them lazily inside the interpreter's main loop. + +I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough. + +Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work. + +## Caching the current block + +I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop. + +I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block. + +## Caching the current instruction list + +I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list. + +Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before. + +## Fat LTO and codegen-units = 1 + +Did absolutely nothing. I couldn't outsmart LLVM. + +## Remove RefCell + +(upcoming) + +I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc >` with my own ref counting. This might +remove a layer of indirection, too. + +It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off. + +## Iterating over instruction list + +(upcoming) + +I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step. + +Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer? + +Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.