📝 doc: document attempted and planned optimizations

2023-10-03 16:17:34 -05:00 · 2023-10-03 16:17:34 -05:00 · 32ddedc066
parent 43a3294c57
commit 32ddedc066
1 changed files with 67 additions and 0 deletions
--- a/notes.md
+++ b/notes.md
@ -0,0 +1,67 @@
+# Optimizations
+
+Making notes on optimizations I've made and plan to make, so I can remember which ones paid off.
+
+## String interning
+
+Worked well. PUC Lua does this. I think it's faster not because it avoids
+hashing or comparing strings, but because it avoids the pointer deref.
+I still ended up hashing ints after this change.
+
+## Linear search
+
+The n_body benchmark uses tables with about 7 slots in its hot loop.
+The hashing overhead of HashMap for i64 seems pretty bad for this.
+BTreeMap was faster, but not fast enough.
+
+I switched to just an unsorted Vec and linear search, and it's the
+fastest by a small margin.
+
+I don't think PUC Lua does this, but PUC Lua might have a faster, less
+secure hash algorithm than Rust's default.
+
+Flamegraph reveals we still spend a lot of time in linear searching tables.
+
+## Lazy instruction decoding
+
+I think this actually slowed it down. PUC Lua keeps instructions in their
+encoded u32 form and decodes them lazily inside the interpreter's main loop.
+
+I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough.
+
+Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work.
+
+## Caching the current block
+
+I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop.
+
+I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block.
+
+## Caching the current instruction list
+
+I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list.
+
+Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before.
+
+## Fat LTO and codegen-units = 1
+
+Did absolutely nothing. I couldn't outsmart LLVM.
+
+## Remove RefCell
+
+(upcoming)
+
+I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc <RefCell <Table>>` with my own ref counting. This might
+remove a layer of indirection, too.
+
+It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off.
+
+## Iterating over instruction list
+
+(upcoming)
+
+I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step.
+
+Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer?
+
+Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.