📝 docs: document wget spidering
parent
7645831a09
commit
b62c1424fa
|
@ -3,5 +3,5 @@
|
|||
/ptth_server.toml
|
||||
/ptth_relay.toml
|
||||
/ptth_build_L6KLMVS6/
|
||||
/scraper-secret.txt
|
||||
/target
|
||||
|
||||
|
|
|
@ -61,6 +61,28 @@ e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.
|
|||
By polling with HEAD and byte range requests, a scraper client can approximate
|
||||
`tail -f` behavior of a server-side file.
|
||||
|
||||
`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`
|
||||
|
||||
Use wget's recursive spidering to download all the files in a folder.
|
||||
The human-friendly HTML interface is exposed through the scraper
|
||||
API, so this will also download the HTML directory listings.
|
||||
|
||||
- `--continue` uses the server's content-length header to skip over
|
||||
files that are already fully downloaded to local disk. Partial
|
||||
downloads will be resumed where they left off, which is fine
|
||||
for long-running log files that may append new data but not
|
||||
modify old data.
|
||||
- `--execute robots=off` disables wget's handling of robots.txt.
|
||||
We know we're a robot, the server doesn't care, it's fine.
|
||||
- `--no-parent` prevents the `../` links from accidentally causing
|
||||
infinite recursion.
|
||||
- `--recursive` causes wget to recurse into individual files, and
|
||||
into subdirectories.
|
||||
- `--header $(<scraper-secret.txt)` tells Bash to load the
|
||||
secret API key from disk and send it to wget. The secret will
|
||||
leak into the process list, but at least it won't leak into
|
||||
your bash_history file.
|
||||
|
||||
## Problem statement
|
||||
|
||||
PTTH has 2 auth routes:
|
||||
|
|
Loading…
Reference in New Issue