📝 doc: document attempted and planned optimizations
							parent
							
								
									43a3294c57
								
							
						
					
					
						commit
						32ddedc066
					
				|  | @ -0,0 +1,67 @@ | |||
| # Optimizations | ||||
| 
 | ||||
| Making notes on optimizations I've made and plan to make, so I can remember which ones paid off. | ||||
| 
 | ||||
| ## String interning | ||||
| 
 | ||||
| Worked well. PUC Lua does this. I think it's faster not because it avoids | ||||
| hashing or comparing strings, but because it avoids the pointer deref. | ||||
| I still ended up hashing ints after this change. | ||||
| 
 | ||||
| ## Linear search | ||||
| 
 | ||||
| The n_body benchmark uses tables with about 7 slots in its hot loop. | ||||
| The hashing overhead of HashMap for i64 seems pretty bad for this. | ||||
| BTreeMap was faster, but not fast enough. | ||||
| 
 | ||||
| I switched to just an unsorted Vec and linear search, and it's the | ||||
| fastest by a small margin. | ||||
| 
 | ||||
| I don't think PUC Lua does this, but PUC Lua might have a faster, less | ||||
| secure hash algorithm than Rust's default. | ||||
| 
 | ||||
| Flamegraph reveals we still spend a lot of time in linear searching tables. | ||||
| 
 | ||||
| ## Lazy instruction decoding | ||||
| 
 | ||||
| I think this actually slowed it down. PUC Lua keeps instructions in their | ||||
| encoded u32 form and decodes them lazily inside the interpreter's main loop. | ||||
| 
 | ||||
| I did this mostly to match PUC Lua, although I didn't think it would work. My enum for decoded instructions is only 64 bits, and I didn't think the extra bit fiddling was cheap enough. | ||||
| 
 | ||||
| Maybe if I tweaked it, it would pay off. It just really doesn't look like it should work. | ||||
| 
 | ||||
| ## Caching the current block | ||||
| 
 | ||||
| I think this one paid off. The idea was to avoid some `chunk.blocks [i]` derefs and bound checks in the inner loop. | ||||
| 
 | ||||
| I used an `Rc` to make it work. PUC Lua probably just keeps a raw pointer to the block. | ||||
| 
 | ||||
| ## Caching the current instruction list | ||||
| 
 | ||||
| I think this one paid off more. Instead of caching the current block I just cached its instructions, since the inner loop doesn't use constants or upvalues much, but every step requires access to the instruction list. | ||||
| 
 | ||||
| Using `Rc <[u32]>` was fun, too. I never stored a slice directly in a smart pointer before. | ||||
| 
 | ||||
| ## Fat LTO and codegen-units = 1 | ||||
| 
 | ||||
| Did absolutely nothing. I couldn't outsmart LLVM. | ||||
| 
 | ||||
| ## Remove RefCell | ||||
| 
 | ||||
| (upcoming) | ||||
| 
 | ||||
| I think the `borrow` and `borrow_mut` calls slow down OP_GETFIELD and OP_SETFIELD. I can remove them if I store all the tables in State directly, replacing `Rc <RefCell <Table>>` with my own ref counting. This might | ||||
| remove a layer of indirection, too. | ||||
| 
 | ||||
| It's a big change, but I'd need _something_ like this for adding a GC anyway, and sometimes big changes have paid off. | ||||
| 
 | ||||
| ## Iterating over instruction list | ||||
| 
 | ||||
| (upcoming) | ||||
| 
 | ||||
| I noticed PUC Lua doesn't store a program counter, it stores a `u32 *`, a pointer to the next instruction itself. This might save, like, 1 single cycle or something, I can't believe it does anything, but it could. Because it saves you that "Look at the instruction list, multiply the index by 4, add it to the base pointer" step. | ||||
| 
 | ||||
| Maybe the real saving is that it saves a little bit of cache space by forgetting the base pointer? | ||||
| 
 | ||||
| Storing an iterator sounds like a big fight with the borrow checker. I might want to prototype it outside the interpreter first. But if it works, it might compile down to what PUC Lua does in C. Plus a bounds check. | ||||
		Loading…
	
		Reference in New Issue
	
	 _
						_