70 lines
3.1 KiB
Markdown
70 lines
3.1 KiB
Markdown
# Optimizations
|
|
|
|
Making notes on optimizations I've made and plan to make, so I can remember which ones paid off.
|
|
|
|
## String interning
|
|
|
|
Worked well. PUC Lua does this. I think it's faster not because it avoids
|
|
hashing or comparing strings, but because it avoids the pointer deref.
|
|
I still ended up hashing ints after this change.
|
|
|
|
## Linear search
|
|
|
|
The n_body benchmark uses tables with about 7 slots in its hot loop.
|
|
The hashing overhead of HashMap for i64 seems pretty bad for this.
|
|
BTreeMap was faster, but not fast enough.
|
|
|
|
I switched to just an unsorted Vec and linear search, and it's the
|
|
fastest by a small margin.
|
|
|
|
I don't think PUC Lua does this, but PUC Lua might have a faster, less
|
|
secure hash algorithm than Rust's default.
|
|
|
|
Flamegraph reveals we still spend a lot of time in linear searching tables.
|
|
|
|
## Lazy instruction decoding
|
|
|
|
I think this actually slowed it down. PUC Lua keeps instructions in their
|
|
encoded u32 form and decodes them lazily inside the interpreter's main loop.
|
|
|
|
I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough.
|
|
|
|
Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work.
|
|
|
|
## Caching the current block
|
|
|
|
I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop.
|
|
|
|
I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block.
|
|
|
|
## Caching the current instruction list
|
|
|
|
I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list.
|
|
|
|
Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before.
|
|
|
|
## Fat LTO and codegen-units = 1
|
|
|
|
Did absolutely nothing. I couldn't outsmart LLVM.
|
|
|
|
## Remove RefCell
|
|
|
|
Result: Worked well. Dropped from about 3,700 to 3,200 samples. The code got even uglier.
|
|
|
|
Plan:
|
|
|
|
I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc <RefCell <Table>>` with my own ref counting. This might
|
|
remove a layer of indirection, too.
|
|
|
|
It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off.
|
|
|
|
## Iterating over instruction list
|
|
|
|
(upcoming)
|
|
|
|
I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step.
|
|
|
|
Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer?
|
|
|
|
Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check.
|