263 lines
8.2 KiB
Markdown
263 lines
8.2 KiB
Markdown
# Auth route for scrapers
|
|
|
|
(Find this issue with `git grep YNQAQKJS`)
|
|
|
|
## Test curl commands
|
|
|
|
Put your API key into a header file, like this:
|
|
|
|
```
|
|
X-ApiKey: bad_password
|
|
```
|
|
|
|
Export the scraper API's URL prefix to an environment variable:
|
|
|
|
`export API=http://127.0.0.1:4000/scraper`
|
|
|
|
Call it "scraper-secret.txt" or something else obviously secret.
|
|
Don't check it into Git. The key will expire every 30 days and need
|
|
to be rotated manually. (for now)
|
|
|
|
New versions of Curl can load headers from a text file. All commands
|
|
will use this feature to load the API key.
|
|
|
|
`curl --header @scraper-secret.txt $API/api/test`
|
|
|
|
Should return "You're valid!"
|
|
|
|
`curl --header @scraper-secret.txt $API/v1/server_list`
|
|
|
|
Should return a JSON object listing all the servers.
|
|
|
|
`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/`
|
|
|
|
Proxies into the "aliens_wildland" server and retrieves a JSON object listing
|
|
the file server root. (The server must be running a new version of ptth_server
|
|
which can serve the JSON API)
|
|
|
|
`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/api/v1/dir/src/`
|
|
|
|
Same, but retrieves the listing for "/src".
|
|
|
|
`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`
|
|
|
|
There is no special API for retrieving files yet - But the existing server
|
|
API will be is proxied through the new scraper API on the relay.
|
|
|
|
`curl --header @scraper-secret.txt $API/v1/server/aliens_wildland/files/src/tests.rs`
|
|
|
|
PTTH supports HEAD requests. This request will yield a "204 No Content", with
|
|
the "content-length" header.
|
|
|
|
`curl --header @scraper-secret.txt -H "range: bytes=100-199" $API/v1/server/aliens_wildland/files/src/tests.rs`
|
|
|
|
PTTH supports byte range requests. This request will skip 100 bytes into the
|
|
file, and read 100 bytes.
|
|
|
|
To avoid fence-post errors, most programming languages use half-open ranges.
|
|
e.g. `0..3` means "0, 1, 2". However, HTTP byte ranges are closed ranges.
|
|
e.g. `0..3` means "0, 1, 2, 3". So 100-199 means 199 is the last byte retrieved.
|
|
|
|
By polling with HEAD and byte range requests, a scraper client can approximate
|
|
`tail -f` behavior of a server-side file.
|
|
|
|
`wget --continue --execute robots=off --no-parent --recursive --header "$(<scraper-secret.txt)" $API/v1/server/aliens_wildland/files/crates/`
|
|
|
|
Use wget's recursive spidering to download all the files in a folder.
|
|
The human-friendly HTML interface is exposed through the scraper
|
|
API, so this will also download the HTML directory listings.
|
|
|
|
- `--continue` uses the server's content-length header to skip over
|
|
files that are already fully downloaded to local disk. Partial
|
|
downloads will be resumed where they left off, which is fine
|
|
for long-running log files that may append new data but not
|
|
modify old data.
|
|
- `--execute robots=off` disables wget's handling of robots.txt.
|
|
We know we're a robot, the server doesn't care, it's fine.
|
|
- `--no-parent` prevents the `../` links from accidentally causing
|
|
infinite recursion.
|
|
- `--recursive` causes wget to recurse into individual files, and
|
|
into subdirectories.
|
|
- `--header $(<scraper-secret.txt)` tells Bash to load the
|
|
secret API key from disk and send it to wget. The secret will
|
|
leak into the process list, but at least it won't leak into
|
|
your bash_history file.
|
|
|
|
## Problem statement
|
|
|
|
PTTH has 2 auth routes:
|
|
|
|
- A fixed API key for servers
|
|
- Whatever the end user puts in front of the HTML client
|
|
|
|
"Whatever" is hard for scrapers to deal with. This barrier to scraping
|
|
is blocking these issues:
|
|
|
|
- EOTPXGR3 Remote `tail -f`
|
|
- UPAQ3ZPT Audit logging of the relay itself
|
|
- YMFMSV2R Add Prometheus metrics
|
|
|
|
## Proposal
|
|
|
|
Add a 3rd auth route meeting these criteria:
|
|
|
|
- Enabled by a feature flag, disabled by default
|
|
- Bootstrapped by the user-friendly HTML frontend
|
|
- Suitable for headless automated scrapers
|
|
|
|
It will probably involve an API key like the servers use. Public-key
|
|
crypto is stronger, but involves more work. I think we should plan to
|
|
start with something weak, and also plan to deprecate it once something
|
|
stronger is ready.
|
|
|
|
## Proposed impl plan
|
|
|
|
- (X) Add feature flags to ptth_relay.toml for dev mode and scrapers
|
|
- (X) Make sure Docker release CAN build
|
|
- (X) Add hash of 1 scraper key to ptth_relay.toml, with 1 week expiration
|
|
- (X) Accept scraper key for some testing endpoint
|
|
- (X) (POC) Test with curl
|
|
- (X) Clean up scraper endpoint
|
|
- (X) Add (almost) end-to-end tests for test scraper endpoint
|
|
- (X) Thread server endpoints through relay scraper auth
|
|
- (don't care) Add tests for other scraper endpoints
|
|
- (don't care) Factor v1 API into v1 module
|
|
- (X) Add real scraper endpoints
|
|
- ( ) Manually create SQLite DB for scraper keys, add 1 hash
|
|
- ( ) Impl DB reads
|
|
- ( ) Remove scraper key from config file
|
|
- ( ) Make sure `cargo test` passes and Docker CAN build
|
|
- ( ) (MVP) Test with curl
|
|
- ( ) Impl and test DB init / migration
|
|
- ( ) Impl DB writes (Add / revoke keys) as CLI commands
|
|
- ( ) Implement API (Behind X-Email auth) for that, test with curl
|
|
- ( ) Set up mitmproxy or something to add X-Email header in dev env
|
|
- ( ) Implement web UI (Behind X-Email)
|
|
|
|
POC is the proof-of-concept - At this point we will know that in theory the
|
|
feature can work.
|
|
|
|
MVP is the first deployable version - I could put it in prod, manually fudge
|
|
the SQLite DB to add a 1-month key, and let people start building scrapers.
|
|
|
|
Details:
|
|
|
|
Dev mode will allow anonymous users to generate scraper keys. In prod mode,
|
|
(the default) clients will need to have the X-Email header set or use a
|
|
scraper key to do anything.
|
|
|
|
Design the DB so that the servers can share it one day.
|
|
|
|
Design the API so that new types of auth / keys can be added one day, and
|
|
the old ones deprecated.
|
|
|
|
Endpoints needed:
|
|
|
|
- (X) Query server list
|
|
- (X) Query directory in server
|
|
- (not needed) GET file with byte range (identical to frontend file API)
|
|
|
|
These will all be JSON for now since Python, Rust, C++, C#, etc. can handle it.
|
|
For compatibility with wget spidering, I _might_ do XML or HTML that's
|
|
machine-readable. We'll see.
|
|
|
|
## DB / UI impl
|
|
|
|
Sprint 1:
|
|
|
|
- Look up keys by their hash
|
|
- not_before
|
|
- not_after
|
|
- name
|
|
- X-Email associated with key
|
|
|
|
Sprint 2:
|
|
|
|
- UI to generate / revoke keys
|
|
|
|
## SQL schema
|
|
|
|
Migration
|
|
|
|
```
|
|
create table scraper_keys (
|
|
hash text primary key, -- Using blake3 for this because it's not a password
|
|
not_before integer not null, -- Seconds since epoch
|
|
not_after integer not null, -- Seconds since epoch
|
|
name text not null, -- Human-friendly nickname
|
|
email text not null -- Email address that created the key
|
|
);
|
|
```
|
|
|
|
Look up hash
|
|
|
|
```
|
|
select not_before, not_after name, email
|
|
from scraper_keys
|
|
where
|
|
hash = $1 and
|
|
strftime ('%s') >= not_before and
|
|
strftime ('%s') < not_after
|
|
;
|
|
```
|
|
|
|
Create key
|
|
|
|
```
|
|
-- Generate entropy in app code
|
|
insert into scraper_keys (
|
|
hash,
|
|
not_before,
|
|
not_after,
|
|
name,
|
|
email
|
|
) values (
|
|
$1,
|
|
strftime ('%s'),
|
|
strftime ('%s') + 2592000,
|
|
$4,
|
|
$5
|
|
);
|
|
|
|
-- Respond to client with plaintext key and then forget it.
|
|
-- If a network blip causes the key to evaporate, the client should revoke it.
|
|
```
|
|
|
|
Revoke key
|
|
|
|
```
|
|
|
|
```
|
|
|
|
## Decision journal
|
|
|
|
**Who generates the API key? The scraper client, or the PTTH relay server?**
|
|
|
|
The precedent from big cloud vendors seems to be that the server generates
|
|
tokens. This is probably to avoid a situation where clients with vulnerable
|
|
crypto code or just bad code generate low-entropy keys. By putting that
|
|
responsibility on the server, the server can enforce high-entropy keys.
|
|
|
|
**Should the key rotate? If so, how?**
|
|
|
|
The key should _at least_ expire. If it expires every 30 or 90 days, then a
|
|
human is slightly inconvenienced to service their scraper regularly.
|
|
|
|
When adding other features, we must consider the use cases:
|
|
|
|
1. A really dumb Bash script that shells out to curl
|
|
2. A Python script
|
|
3. A sophisticated desktop app in C#, Rust, or C++
|
|
4. Eventually replacing the fixed API keys used in ptth_server
|
|
|
|
For the Bash script, rotation will probably be difficult, and I'm okay if
|
|
our support for that is merely "It'll work for 30 days at a time, then you
|
|
need to rotate keys manually."
|
|
|
|
For the Python script, rotation could be automated, but cryptography is
|
|
still probably difficult. I think some AWS services require actual crypto
|
|
keys, and not just high-entropy password keys.
|
|
|
|
For the sophisticated desktop app, cryptography is on the table, but this
|
|
is the least likely use case to ever happen, too.
|