📝 doc: document attempted and planned optimizations

main
_ 2023-10-03 16:17:34 -05:00
parent 43a3294c57
commit 32ddedc066
1 changed files with 67 additions and 0 deletions

67
notes.md Normal file
View File

@ -0,0 +1,67 @@
# Optimizations
Making notes on optimizations I've made and plan to make, so I can remember which ones paid off.
## String interning
Worked well. PUC Lua does this. I think it's faster not because it avoids
hashing or comparing strings, but because it avoids the pointer deref.
I still ended up hashing ints after this change.
## Linear search
The n_body benchmark uses tables with about 7 slots in its hot loop.
The hashing overhead of HashMap for i64 seems pretty bad for this.
BTreeMap was faster, but not fast enough.
I switched to just an unsorted Vec and linear search, and it's the
fastest by a small margin.
I don't think PUC Lua does this, but PUC Lua might have a faster, less
secure hash algorithm than Rust's default.
Flamegraph reveals we still spend a lot of time in linear searching tables.
## Lazy instruction decoding
I think this actually slowed it down. PUC Lua keeps instructions in their
encoded u32 form and decodes them lazily inside the interpreter's main loop.
I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough.
Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work.
## Caching the current block
I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop.
I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block.
## Caching the current instruction list
I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list.
Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before.
## Fat LTO and codegen-units = 1
Did absolutely nothing. I couldn't outsmart LLVM.
## Remove RefCell
(upcoming)
I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc <RefCell <Table>>` with my own ref counting. This might
remove a layer of indirection, too.
It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off.
## Iterating over instruction list
(upcoming)
I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step.
Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer?
Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.