📝 docs: document wget spidering

main
_ 2020-12-21 22:16:12 -06:00
parent 7645831a09
commit b62c1424fa
2 changed files with 23 additions and 1 deletions

2
.gitignore vendored
View File

@ -3,5 +3,5 @@
/ptth_server.toml
/ptth_relay.toml
/ptth_build_L6KLMVS6/
/scraper-secret.txt
/target

View File

@ -61,6 +61,28 @@ e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.
By polling with HEAD and byte range requests, a scraper client can approximate
`tail -f` behavior of a server-side file.
`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`
Use wget's recursive spidering to download all the files in a folder.
The human-friendly HTML interface is exposed through the scraper
API, so this will also download the HTML directory listings.
- `--continue` uses the server's content-length header to skip over
files that are already fully downloaded to local disk. Partial
downloads will be resumed where they left off, which is fine
for long-running log files that may append new data but not
modify old data.
- `--execute robots=off` disables wget's handling of robots.txt.
We know we're a robot, the server doesn't care, it's fine.
- `--no-parent` prevents the `../` links from accidentally causing
infinite recursion.
- `--recursive` causes wget to recurse into individual files, and
into subdirectories.
- `--header $(<scraper-secret.txt)` tells Bash to load the
secret API key from disk and send it to wget. The secret will
leak into the process list, but at least it won't leak into
your bash_history file.
## Problem statement
PTTH has 2 auth routes: