📝 docs: document wget spidering

2020-12-21 22:16:12 -06:00 · 2020-12-21 22:16:12 -06:00 · b62c1424fa
parent 7645831a09
commit b62c1424fa
2 changed files with 23 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -3,5 +3,5 @@
 /ptth_server.toml
 /ptth_relay.toml
 /ptth_build_L6KLMVS6/
+/scraper-secret.txt
 /target
-
--- a/issues/2020-12Dec/auth-route-YNQAQKJS.md
+++ b/issues/2020-12Dec/auth-route-YNQAQKJS.md
@ -61,6 +61,28 @@ e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.
 By polling with HEAD and byte range requests, a scraper client can approximate
 `tail -f` behavior of a server-side file.

+`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`
+
+Use wget's recursive spidering to download all the files in a folder.
+The human-friendly HTML interface is exposed through the scraper
+API, so this will also download the HTML directory listings.
+
+- `--continue` uses the server's content-length header to skip over
+files that are already fully downloaded to local disk. Partial
+downloads will be resumed where they left off, which is fine
+for long-running log files that may append new data but not
+modify old data.
+- `--execute robots=off` disables wget's handling of robots.txt.
+We know we're a robot, the server doesn't care, it's fine.
+- `--no-parent` prevents the `../` links from accidentally causing
+infinite recursion.
+- `--recursive` causes wget to recurse into individual files, and
+into subdirectories.
+- `--header $(<scraper-secret.txt)` tells Bash to load the 
+secret API key from disk and send it to wget. The secret will
+leak into the process list, but at least it won't leak into
+your bash_history file.
+
 ## Problem statement

 PTTH has 2 auth routes: