# Auth route for scrapers (Find this issue with `git grep YNQAQKJS`) ## Test curl commands Export the scraper API's URL prefix to an environment variable: `export API=http://127.0.0.1:4000/scraper` Put your API key into a header file, like this: ``` X-ApiKey: bad_password ``` Call it "scraper-secret.txt" or something else obviously secret. Don't check it into Git. The key will expire every 30 days and need to be rotated manually. (for now) New versions of Curl can load headers from a text file. All commands will use this feature to load the API key. `curl --header @scraper-secret.txt $API/api/test` Should return "You're valid!" `curl --header @scraper-secret.txt $API/v1/server_list` Should return a JSON object listing all the servers. `curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/` Proxies into the "aliens_wildland" server and retrieves a JSON object listing the file server root. (The server must be running a new version of ptth_server which can serve the JSON API) `curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/src/` Same, but retrieves the listing for "/src". `curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs` There is no special API for retrieving files yet - But the existing server API will be is proxied through the new scraper API on the relay. `curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs` PTTH supports HEAD requests. This request will yield a "204 No Content", with the "content-length" header. `curl --header @scraper-secret.txt -H "range: bytes=100-199" $API/v1/server/aliens_wildland/files/src/tests.rs` PTTH supports byte range requests. This request will skip 100 bytes into the file, and read 100 bytes. To avoid fence-post errors, most programming languages use half-open ranges. e.g. `0..3` means "0, 1, 2". However, HTTP byte ranges are closed ranges. e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved. By polling with HEAD and byte range requests, a scraper client can approximate `tail -f` behavior of a server-side file. `wget --continue --execute robots=off --no-parent --recursive --header "$(= not_before and strftime ('%s') < not_after ; ``` Create key ``` -- Generate entropy in app code insert into scraper_keys ( hash, not_before, not_after, name, email ) values ( $1, strftime ('%s'), strftime ('%s') + 2592000, $4, $5 ); -- Respond to client with plaintext key and then forget it. -- If a network blip causes the key to evaporate, the client should revoke it. ``` Revoke key ``` ``` ## Decision journal **Who generates the API key? The scraper client, or the PTTH relay server?** The precedent from big cloud vendors seems to be that the server generates tokens. This is probably to avoid a situation where clients with vulnerable crypto code or just bad code generate low-entropy keys. By putting that responsibility on the server, the server can enforce high-entropy keys. **Should the key rotate? If so, how?** The key should _at least_ expire. If it expires every 30 or 90 days, then a human is slightly inconvenienced to service their scraper regularly. When adding other features, we must consider the use cases: 1. A really dumb Bash script that shells out to curl 2. A Python script 3. A sophisticated desktop app in C#, Rust, or C++ 4. Eventually replacing the fixed API keys used in ptth_server For the Bash script, rotation will probably be difficult, and I'm okay if our support for that is merely "It'll work for 30 days at a time, then you need to rotate keys manually." For the Python script, rotation could be automated, but cryptography is still probably difficult. I think some AWS services require actual crypto keys, and not just high-entropy password keys. For the sophisticated desktop app, cryptography is on the table, but this is the least likely use case to ever happen, too.