📝 doc: document attempted and planned optimizations
parent
43a3294c57
commit
32ddedc066
|
@ -0,0 +1,67 @@
|
|||
# Optimizations
|
||||
|
||||
Making notes on optimizations I've made and plan to make, so I can remember which ones paid off.
|
||||
|
||||
## String interning
|
||||
|
||||
Worked well. PUC Lua does this. I think it's faster not because it avoids
|
||||
hashing or comparing strings, but because it avoids the pointer deref.
|
||||
I still ended up hashing ints after this change.
|
||||
|
||||
## Linear search
|
||||
|
||||
The n_body benchmark uses tables with about 7 slots in its hot loop.
|
||||
The hashing overhead of HashMap for i64 seems pretty bad for this.
|
||||
BTreeMap was faster, but not fast enough.
|
||||
|
||||
I switched to just an unsorted Vec and linear search, and it's the
|
||||
fastest by a small margin.
|
||||
|
||||
I don't think PUC Lua does this, but PUC Lua might have a faster, less
|
||||
secure hash algorithm than Rust's default.
|
||||
|
||||
Flamegraph reveals we still spend a lot of time in linear searching tables.
|
||||
|
||||
## Lazy instruction decoding
|
||||
|
||||
I think this actually slowed it down. PUC Lua keeps instructions in their
|
||||
encoded u32 form and decodes them lazily inside the interpreter's main loop.
|
||||
|
||||
I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough.
|
||||
|
||||
Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work.
|
||||
|
||||
## Caching the current block
|
||||
|
||||
I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop.
|
||||
|
||||
I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block.
|
||||
|
||||
## Caching the current instruction list
|
||||
|
||||
I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list.
|
||||
|
||||
Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before.
|
||||
|
||||
## Fat LTO and codegen-units = 1
|
||||
|
||||
Did absolutely nothing. I couldn't outsmart LLVM.
|
||||
|
||||
## Remove RefCell
|
||||
|
||||
(upcoming)
|
||||
|
||||
I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc <RefCell <Table>>` with my own ref counting. This might
|
||||
remove a layer of indirection, too.
|
||||
|
||||
It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off.
|
||||
|
||||
## Iterating over instruction list
|
||||
|
||||
(upcoming)
|
||||
|
||||
I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step.
|
||||
|
||||
Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer?
|
||||
|
||||
Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.
|
Loading…
Reference in New Issue